You are on page 1of 117

A Survey of Recent Multidimensional Access Methods

Jayendra Venkateswaran, University of Missouri-Rolla

Abstract
Indexing spatial data has been a major area of research for the past two decades. A direct mapping of the objects from the multi-dimensional space to one-dimensional space does not exist. This resulted in researchers developing many Multidimensional Access Methods for efficient processing of spatial information in large databases. This paper examines various spatial indexes proposed in literature and taxonomy of these structures is presented. Each structure is reviewed by brief summary, comparison with similar structures, characteristics and algorithms for various operations. Finally a comparative analysis of all of these structures is presented.

1. Introduction
Spatial data is the data that has connection with coordinates is single or even multidimensional spaces. Spatial databases systems are gaining importance various industries and researches over the past decade. Spatial data is used in many applications such as Cartography, Computer Vision and Robotics and Scientific and Temporal databases. Spatial databases are collections of spatial objects like points, lines and high-dimensional objects. The Spatial Database Systems need to integrate the data obtained from various sources and by different ways and whenever required has to support the analysis and processing of data stored. Due to the large volume of spatial data and their complex structures and relationships, spatial operations have become more expensive compared to conventional operators like join and select. The efficiency of the operations on a structure depends on its representation and how fast the relevant data are retrieved for a particular operation. Spatial data is multi-dimensional and hence there can be no linear ordering among the spatial objects which can preserve their spatial proximity [Gaed97]. Hence the conventional methods such as B-Tree [Baye72] or linear hashing cannot be applied for indexing spatial databases. Some proposals were made to store the pre-computed spatial relationships ([Luha92] and [Rote91]). These methods proved to be inefficient to the large volume of spatial objects, where spatial relationships are determined dynamically during search and update operations. Index structure must support efficient spatial operations. As the dimensionality of the data space increases the mathematical problems increases. These effects can be seen only at data space of higher dimensions. Data Overlap makes it impossible to partition a multi-attribute search space containing spatial data by simple node-splitting algorithms. The important attributes of a data base, Volume and Area, depend exponentially on the number of dimension in the data space. Hence Index

structures which are efficient for data spaces of fewer dimensions, the performance degrades when extended to higher dimensions. This is known as Curse of Dimensionality In High dimensional data spaces, [Bohn01] classifies the effects as follows: Geometric Effects: As the dimension increases, the volume of the (hyper) - cubes and spheres increases. The volume of a cube in a d-dimensional space with edge length of e is given by V = ed . Effects of Partition: As the dimension increases the space partitioning becomes coarser. Database Environment: The query distribution is affected as the dimensionality of the data space increases.

Figure A: Spatial Data Running Example Figure A is an example of 2-dimensional Spatial Data which will be used as a running example for all the structures discussed in this paper. Lower-case alphabets are used to label point objects and upper-case alphabets are used to label objects of spatial extent. This paper is organized in the following way. Section 2 presents the classification of the multi-dimensional access methods. A general overview of the structures explained in Section 3. Taxonomy is presented to illustrate the evolution of the Multi-dimensional Access Methods for the structures discussed in this paper. Each of these structures is explained in detail Section 4. Section 5 compares the different characteristics of these structures and some of the experimental results obtained various researchers are presented. Finally Section 6 concludes the paper with some insight into future works.

2. Classification
Multi-dimensional Access Methods can be classified into Point Access Methods (PAM) and Spatial Access Method (SAM) ([Gaed97]). PAM was designed to perform operations on database of spatial objects which do not have spatial extension. But SAM performs operations on spatial objects like lines, polygon and higher-dimensional objects.

2.1 Point Access Method


Several classifications of PAM under different categories can be found in [Same90] and [Gaed97]. In [Gaed97] it is classified into two categories: Multi-dimensional Hashing Methods and Hierarchical Access Methods. Multi-dimensional Hashing methods such as grid file ([Hinr83], [Sevc94]) use one-dimensional hashing to represent the multidimensional objects using different heuristics to preserve the spatial proximity of the objects. Hierarchical Methods like Quadtree ([Bent74], [Garg82]), K-D-Tree [Bent75], and K-D-B-Tree [Robi81] use hierarchical data structures to store the point data. Spacefilling curves ([Bial69], [Saga94], [Falo89], [Same90]) were often used to preserve the spatial proximity during the linear ordering of the spatial objects. UB-Tree ([Baye97]) uses z-ordering for mapping objects into one-dimensional sequence.

2.2 Spatial Access Methods


These can be considered as extension of PAM for processing objects with spatial extent. Based on the classification techniques proposed in [Lome92] and [Seeg88] the spatial access methods can be classified under the following categories: Transformation Methods: One approach is to map spatial objects to points in highdimensional spaces. Then the points are stored using existing PAM. But this approach does not preserve the spatial proximity as the dimensionality increases. Another approach is to use space- filling curves to map the objects of higher-dimensional space to onedimensional points such as Hilbert-R-Tree [Falo94] which uses hilbert-curve in organizing the data. Overlapping Methods: These methods partition the data space hierarchically into smaller subspaces. Data objects are stored in the leaf nodes while the intermediate nodes facilitate efficiency during search operations. A node may overlap with its sibling nodes and hence multiple paths may have to be traversed during the course of searching an object. R-Tree [Gutt84] and its variants R*-Tree [Krei90], SS-Tree [Jain96], X-Tree [Krei96] and SRTree [Kata97] are some of the promising methods. Clipping Methods: Hierarchical data structures are used as in the case of Overlapping Methods, but these methods clip the objects to prevent overlapping of objects. This ensures that there exists only one path during the search of a data object. R+-Tree [Sell87] is an example of this method.

The index structures can also be classified into two categories - Data Organizing Structures such as R-tree [Gutt84] and Space Organizing Structures such as such as Quadtree [Bent74] and Grid-Files [Sevc94]. Several surveys such as [Ahn01], [Bohn01] [Gaed97] and [Proc97] provide background and analysis of these methods. Several other surveys analyze specific classifications such as Tree-Based Index structures ([Gunt91], [Brow98]) and structures for Spatial Information Processing ([Same95], [Kuba01], [Gunt90], [Guti94], [Widm97], [Kolo90]). An analysis, performance evaluation and comparison of some of these structures can be found in [Jain95], [Gree89], [Roge98] and [Webe98].

3. Taxonomy
The basic issues to be addressed when designing index structures for spatial information processing are storage-utilization and fast information retrieval. Other issues that are to be minimized while designing spatial index structures are: Area of regions of a node, Overlap between two regions, number of objects duplicated to avoid overlaps, directory size and height of the tree. These factors ensure the efficiency for many applications. No straight- forward solution is available which fulfills all of these issues. Other factors such as buffer size and design strategies, space allocation and concurrency control methods can also affect the performance of spatial information processing. Based on its basic structure the index structures can be classified into the following categories: Tree-based Index Structures, Hashing Methods, Methods based on Space-filling curves, Methods based on Distance based Indexing and Signature Methods. The evolution of the index structures discussed in this paper is given in Figure B. Tree-based structures partition the space into a manageable number of smaller subspaces, which are partitioned further and so on. Early structures such as Binary-Tree [Knut73], K-D-Tree [Bent75], B Tree [Baye72] and B+-Tree [Come79] were designed for data based on primary keys. Database applications involve searching on one or more of the secondary keys. For these applications multi-dimensional index structures were developed. All tree-based index structures at the lower level are based on K-D-Tree [Bent75], BTree [Baye72] and B+- Tree [Come79]. K-D-B-Tree [Robi81] is based on KD-Tree and B -Tree, combines the advantages of these two structures and uses only a single attribute value as a boundary. hB-Tree [Lome89, Lome90] and its variant hBp -Tree [Lome97] are based on K-D-B-Tree and uses K-D-Trees [Bent75] to organize the space represented by the interior nodes for very efficient searching. Its goal is to avoid downward cascading of splits, which was one of the drawbacks in K-D-B-Tree. It chooses one or more existing boundaries as the boundary of the splitting index node. The resultant structure is fractal in nature with an external enclosing region and cavities called extracted regions.

Figure B: Index Structures Taxonomy R-Tree [Gutt84] is one of the most important hierarchical structures for indexing spatial objects in high-dimensional spaces. It stores the Minimum-Bounding- Rectangle of the objects. As the R-tree regions can overlap, more than one path may be traversed during search operations. R+-Tree [Sell87] splits the rectangles which overlap to improve the search performance. But this affected its storage utilization due to increase in number of duplicates. R*-Tree [Krei90] is a successful variant of the R -Tree. It addition to using criteria like margin, area and overlap, it uses the concept of forced-reinsertion to reorganize the structure for better storage utilization. Another Variant of R-Tree is HilbertR-Tree [Falo94] uses one of the Space Filling Curves ([Bial69], [Saga94]) to order the objects in the data set. As the dimensionality increases the efficiency of R*-Tree deteriorates due to the increased overlap in high-dimensional spaces. X-Tree [Krei96] was designed for high-dimensional objects. It is an extension of R*-Tree where it has a overlap- free split according to a split history and supernodes. A supernode is an oversized node which prevents overlap when an efficient split axis could not be determined. TVTree [Jaga94] improves the performance of the R*-Tree for high-dimensional spaces. It uses dimensionality reduction and shift (telescoping) of active dimensions. It is useful in

applications where the dimensions can be ordered by significance and there exists feature vectors that allow shift in dimensions. It is applicable to real data that are subject to the Karhunen-Loeve- Transform. SS-Tree [Jain96] is an index structure designed for similarity indexing for multi-dimensional data. It is an improvement of R*-Tree, but used bounding spheres instead of bounding rectangles and modified forced re- insertion. Unlike R*Tree, SS-Tree re- inserts entries when the entries in a node are not reinserted. SR-Tree [Kata-97] can be regarded a combination of SS-Tree and SR-Tree. It uses the intersection between the bounding sphere and the bounding rectangle. Hence outperforms both SSTree and R*-Tree. The size of the directory entry is increased significantly by this approach. G-Tree [Kuma94] combines the properties of B-Tree [Baye72] and Grid file [Sevc94]. It is a balanced index structure and divides the data space into non-overlapping regions. Here the position of each node identifies its corresponding region directly. So even though the splitting procedure is more restrictive than the K-D-B-Tree [Robi81], G-Tree has the advantage of higher storage utilization. MB+-Tree [Yang95] is multidimensional B+-Tree [Come79]. It partitions the data space in disjoint rectangular regions, like the GTree. The regions are ordered and the tree is balanced. The number of levels in the tree is reduced thus reducing the search time. PK-Tree [Wang98] consists of the combined properties of the PR-quadtree [Same90] and K-D-Tree [Bent75] with the removal of unnecessary nodes. It has better performance results compared to methods such as R-tree, SR-Tree and X-Tree. Grid file ([Hinr83], [Sevc94], [Hinr85]) is a hashing based access method which is a variation of the grid method. Its goal is to retrieve objects by at most two disk accesses. This is done by using grid directory consisting of grid blocks. All records in one grid block are stored in the same bucket. ([Beck92], [Regnier85]) provide theoretical analysis of grid file and its variants. Buddy Tree [Seeg90] is a tree-structured dynamic hashing method. The leaves point to the data pages. It uses k-d-tries ([Oren82]) to partition the universe. It avoids the downward splitting of the K-D-B-Tree [Robi81] and overlapping problem of the R-Tree [Gutt84] and generalizes the buddy-system of the grid file [Sevc94] to organize the correlated data efficiently. Space-filling curves ([Bial69], [Saga94], [Falo89], [Same90]) like z-ordering, gray codes and hilbert-curve map the multi-dimensional data space into one-dimensional data space. The points that close to each other in the data space are like to be closer in the embedded data space. UB-Tree (Baye97]) is based on z-ordering, Hilbert-R-Tree ([Falo94]) is based on the hilbert-curve. VP-Tree [Uhlm91] is a hierarchical structure where at each node of the tree one data object is selected to function as a vantage point and the remaining selection objects is based on the distances form the selected object. MVP-Tree [Bozk99] is an extension of VP-Tree based on distance based indexing techniques [Bozk97]. MVP-Tree uses multiple vantage-points at each node of the tree. This results in increased fanout which reduces the search time. M-Tree [Ciac97] partitions objects according to their relative distance. The

distance function used depends on the application. Its goal is to reduce the search time along the number of distance computations. Vector-Approximation File or VA-File [Blot98] is based on the concept of Signature Methods ([Falo85], [Falo87]). Here the vector space is partitioned into cells and these cells are used to generate bit-encoded approximation of each vector. VA-File is the flat array of these approximations. It overcomes the dimensionality curse problem of spatial objects in high-dimensional spaces. A-Tree [Saku00] is based on the concepts of VA-File and SR-Tree. It uses Virtual Bounding Rectangles (VBR), which contain MBRs by approximating the data objects. It achieves better performance than VA-File and SRTree.

4. Multi-dimensional Access Methods 4.1 Tree-based Index Structures


Binary Search Tree is a basic data structure for representing data objects whose index values are in linear ordering. The concept of partitioning the data space recursively has been adopted and generalized in many sophisticated index structures. Early structures such as Binary-Tree [Knut73], K-D-Tree [Bent75], BTree [Baye72] and B+-Tree [Come79] were designed based on the primary key of the data objects. K-D-Tree, BTree and B+-Tree were the basic structures based on which all tree-based index structures were developed. K-D-B-Tree [Robi81] is based on K-D-Tree and B-Tree, combines the advantages of these two structures and uses only a single attribute value as a boundary. In this section, we will examine indexes based on these structures.

4.1.1 R-Tree
Indexing methods such as ISAM, B-Tree and its variants can index only one-dimensional data points. But spatial data cover areas in multi-dimensional spaces and are not represented well by point data. R-Tree [Gutt84] is a multi-dimensional generalization of the B-Tree having hierarchical data structure. It is for indexing spatial objects in highdimensional spaces. Similar to B-tree and B+-Tree, R-tree is a height-balanced tree with index records in its leaf nodes containing pointers to data objects and ensures efficient storage utilization. Nodes correspond to disk pages. The maximum fanout is determined by the size of the disk page where the tree is stored. The structure is dynamic and insertions and deletions can be intermixed with searches and no periodic reorganization is required.

4.1.1.1 R-Tree Vs B+-Tree


R-Trees store the Minimum Bounding-Rectangle (MBR) of the objects. Searching involves traversing more than one sub-tree as the MBRs can overlap. The structure of RTree is based on the spatial location of objects, represented as intervals in several dimensions. The search queries are given as Minimum Bounding Rectangles. R-Tree uses area as the parameter for choosing the insertion path. 7

Fig 1a: R-Tree

Figure 1b: R-Tree Structure R-Tree structure for the running example is shown in Figure 1a. Dots represent the point objects and Solid rectangles represent the bounding regions of the spatial objects. Dotted lines represent the MBRs in the leaf and intermediate nodes of the R- Tree. The data space is comprised of MBRs R1 and R2, which in-turn contains MBRs R3, R4, R5 and R6, R7, R8, respectively. Figure 1b gives the tree structure of this partition. These intermediate regions have the point and spatial objects. A given R -Tree structure is not unique. It depends on the order of insertion of the data objects. For search operations more than one path may need to be traversed. For example, to search for the point object g the paths R1? R5? g and R2 have to be traversed though the object is present in R1.

4.1.1.2 Salient Features


R-Tree uses Minimum Bounding Rectangle (MBR) as Minimum Bounding Box. MBR of an object is the smallest rectangle containing it. Entries in the leaf nodes are of the form [MBR, Record_Pointer] and entries in non- leaf nodes are of the form [MBR, Child_Pointer]. Let M be the maximum number of entries possible in a node, let m = M/2 represent the minimum number of entries in a node and d be the number of dimensions. This lower bound m prevents the degeneration of trees and ensures the efficient storage utilization. If the number of entries in a node falls below m, the node is deleted and the rest of its entries are distributed among the sibling nodes. Based on the size of disk page the upper-bound M can be determined. By storing the bounding boxes of geometric objects such as points, polygons or more complex objects, R-Trees can be used to determine which objects intersect a given query region. R-Tree has the following properties: Except root node, all intermediate and leaf nodes have between m and M index entries. For each entry in the leaf node, the MBR is the smallest rectangle that spatially contains the d-dimensional object pointed by the Child_Pointer. For each entry in a non- leaf node, the MBR is the smallest rectangle that spatially contains the objects in the sub tree pointed by the child node. Unless it is a leaf node, the Root node has at least two children. All leaves appear in the same level.

As each node has at least m entries, the height of an R-Tree of N objects can at most be (logm N) - 1. The Maximum number of nodes is [(N/m) + (N/m2 ) + +1] and the worstcase space utilization for all nodes, except root, is [m/M]. Nodes will tend to have more than m entries, which will decrease the height of the tree and improve space utilization.

4.1.1.3 Operations
Search: In an R-Tree all data objects that overlap with the query region are searched to retrieve objects in the query region. When a node is searched, more than one sub tree may need to be traversed; hence it is not possible to guarantee good worst-case performance. With efficient update algorithms the tree will be maintained in such a form so as to eliminate the irrelevant regions of the indexed space and examine only data near the search area. The search algorithm descends down the tree, at each level selecting entries whose MBR overlap with that of the query. When the leaf-node is reached, the entries whose MBRs overlaps with the MBR of the query object are selected. Search(N,Q) Input: Let N be the root of the R-Tree and Q be the query region given by the user. Output: All Objects in the query region Q.

Step 1: If N is a leaf node, for each entry E whose E.MBR intersects with the query region Q, the object pointed by E.Object_Pointer is retrieved. Step 2: If N is a Non- Leaf Node, for each entry E whose E.MBR intersects with Q, Search(E.Child_Pointer, Q) is invoked. Insertion: Inserting an object in R-Tree involves inserting its MBR to the R -Tree along with a reference to the object. Only one path of the tree is traversed and the new entry is inserted at the leaf node. The insertion algorithm descends down the tree by selecting an entry which requires the least enlargement to include the new object at each intermediate node. Then the object is inserted to the leaf node. Two heuristics have to be defined to handle the insert operation: the choice of a suitable region to insert and to manage overflows. An overflow occurs when the number of entries in a node becomes greater than M after inserting the new entry. Overflows are generally handled by splitting the node. Area-Enlargement(E) = Area(E (including the object to be inserted)) Area(E (without the new object)). If there is no overlap, the insertion algorithm finds a leaf node in O(logm N) time. Insert(N,O) Input: Let N be the root of the R-Tree and O be the object to be inserted. Output: New R-Tree with O inserted. Step 1: If N is a Non-Leaf Node, an entry E whose MBR needs the least enlargement to include O, is selected. Ties are resolved by selecting one with the smallest area and then Insert(E.Child_Pointer,O) is invoked. Step 2: If N is a leaf node, O is inserted into N. If N has more than M entries, Split(N) is invoked to get two nodes N and N. Step 3: If P is the parent node of N with entry E pointing to N, then E.MBR is adjusted to contain all objects in N. Step 4: If there was a split resulting a new node N, an entry E with E.MBR containing the objects in N and E.Child_Pointer pointing to N is created and inserted into P. If P overflows, Split(P) is invoked to get P and P. Step 5: Let N = P and N = P in case of split. If N is a not the root node, the process is repeated from Step 3. Step 6: If the root gets split, a new root is created with E and E pointing to N and N as its entries.

10

Deletion: Deletion in an R-Tree requires an exact match query for the object. Due to the possible overlaps, deletion is not a local operation. The Deletion algorithm searches the R-Tree to find the leaf node which has the object to be deleted. The entry is removed from the leaf node. If there is an underflow due to the deletion, the entries of the node are stored in an array. An underflow occurs when the number of entries in a node becomes less than m, after deleting an entry. This process is repeated recursively on the parents of the node and continued till the root is reached. Then the entries stored in the array are reinserted into the tree. The orphaned entries are merged with the sibling entries and intervals are adjusted. The principle of reinsertion is that an entry must be reinserted into the same level as it was deleted from. Delete(N,O) Input: R-Tree rooted at N and object to be deleted O. Output: New R-Tree with O removed. Step 1: Search(N,O.MBR) is invoked to get leaf node L which contains O. If L is not found, the process is terminated. Step 2: O is removed from L. Step 3: If L is root, then the entries in L are added to Q and step 8 is executed. Step 4: If L is a non-root node, let P be the parent of L and E be its entry in P. Step 5: If L has less than m entries, remove E from P and add entries of L to Q. Step 6: If L has more than m entries, E.MBR is adjusted in P. Step 7: The process is repeated for the parent node P from Step 3. Step 8: For each entry E in Q, Insert(N, E) is invoked. Node -split: A node-split operation is done when there is an overflow of entries in the node. Node-Splitting should minimize the chance that subsequent search must examine both new nodes as much as possible. To insert a new entry in a node of M entries, it is necessary to divide the collection of M+1 entries between two nodes with the total area of the two covering rectangles minimized. Then the split is propagated up the tree. Three algorithms can be used: Exhaustive Algorithm, Quadratic-Cost Algorithm and Linear-Cost Algorithm. Exhaustive Algorithm: All possible groups are generated and a node with minimum area is selected. There can be approximately 2M-1 grouping. Generating all of these 2M-1 groups and selecting the one with minimum area takes exponential time resulting in very high CPU cost.

11

Quadratic-Split Algorithm: This scheme picks two of the (M+1) entries to be the first element of the two new groups, by choosing the pair that would waste most area if both were added to a group. Each of the remaining entries is added to the group, which requires least area enlargement. Ties are resolved by selecting group with smaller area and then fewer entries. Algorithm: Input: Node N to be split Output: Nodes N and N Step 1: For each pair of entries E and E2 , if E is the enclosing rectangle of these two 1 entries, d = Area(E) Area(E1 ) Area(E2 ) is calculated. Step 2: The pair with largest value for d is selected and assigned as the first element of two nodes N and N. Step 3: If all the entries have been assigned, stop the process. Step 4: If a node has few entries such that it requires all of the remaining entries, the remaining entries are assigned to the node and the process is terminated. Step 5: For each unassigned entry E, d1 and d2, the area increase required by N and N respectively to include E is calculated. Step 6: The entry E which has the largest difference between d1 and d2, is selected to be the next entry to be assigned. Step 7: E is added to the node which requires least enlargement to include it. Ties are resolved by using the criteria smaller area and fewer entries. Step 8: The process is repeated from Step 3. Linear-Cost Algorithm: It is similar to the quadratic algorithm except that it selects the first entry for each of the new group in linear time. Algorithm: Input: Node N to be split Output: Nodes N and N Step 1: Along each dimension, two entries whose rectangle has highest low side and lowest high side are found and the separation is recorded.

12

Step 2: The separations are normalized by dividing the width of the entire set along the corresponding dimension. Step 3: The pair with the greatest normalized separation is along any dimension is selected. R-Tree is completely dynamic: insertions and deletions can be intermixed with queries and no periodic global re-organization is required. It is based on the heuristic optimization of the area of the enclosing rectangle in inner nodes. Since the structure allows the bounding rectangles of different entries to overlap one another, the search algorithm must traverse more than one path to search the desired data. Minimizing overlap between sibling nodes is an important issue concerning the searching performance in an R-Tree. It was proposed to index data objects of non-zero size in high dimensional spaces but its index structure can be simply adapted to indexing multidimensional points with some small modifications to its insertion and search algorithms.

4.1.2 R+-Tree
Efficient R- Tree [Gutt84] search requires minimal coverage and overlap. Since it is very hard to control overlap during dynamic split of R-Trees, efficient search strategies may degrade from logarithmic to linear. R+-Tree [Sell87] is an overlap- free variant of R-Tree. It is a compromise between R-Tree and the K-D-B-Tree [Robi81]. R+-Tree was proposed to overcome of overlapping covering rectangles of internal nodes of the R-Tree. It follows the concept that if partitions are allowed to split rectangles, then zero overlap among intermediate nodes can be achieved. Avoiding overlap increases the height of the tree, but has the benefit of multiple shorter paths.

4.1.2.1

R+-Tree Vs R-Tree and K-D-B-Tree

If M is maximum number of entries, R+-tree does not have the property that the number of entries must be between M/2 and M. To guarantee overlap- free regions, the split algorithm uses forced-split strategy. Due to forced splits, height of R+-Tree is more than R-Tree. Split in R+-Tree has to be propagated upwards along the parent nodes and also to down to the child nodes. In R+-Tree, an input rectangle may be added to more than one leaf node. As nodes can overlap in R- Tree, it has expensive searching Although R-trees achieve better space utilization at the expense of search performance 10% degradation is minimal price for the search improvement obtained in R+-tree. R+-Tree has reduced space coverage than K-D-B-Tree.

13

Figure 2a: R+-Tree

Fig 2b: R+-Tree Structure Duplication of the object identifiers results in an overlap-free R+-Tree structure. Figure 2 shows R+-Tree structure for the running example. It can be seen that in cases of overlaps for objects like D, they are clipped and are represented as two entries in both MBRs R5 and R8. For searching the object g, R+-Tree has only one search path: R1? R5? g; whereas two search paths exist for R-Trees.

4.1.2.2

Salient Features

Leaf mode is of form (Record_Pointer, MBR), Where MBR is the minimum bounding rectangle of the data object. Intermediate node is of the form (Child_Pointer, MBR). As nodes are populated as much as possible height of the tree is decreased at the expense of costly updates.

14

It has the following properties: For each entry of the form (Child_Pointer, MBR) in the intermediate node, the subtree rooted at the node pointed by Child_Pointer contains a rectangle R if and only if R is covered by MBR. Overlap between any two entries in an intermediate node is zero. Root has at least two children unless it is leaf. All leaves are at the same level.

4.1.2.3

Operations

Search: Search algorithm is similar to R-Tree. The main difference with that of R-Tree are the non-overlapping regions. The search space is decomposed into disjoint subregions and for each of those the tree is traversed until the data objects that are in the query region, are found in the leaves. Search(N,Q) Input: R+-Tree rooted at N and query region Q. Output: Objects in the query region Q. Step 1: If N is Non-Leaf node, for each entry E whose MBR overlap with Q, Search(E.Child_Pointer,Q) is invoked. Step 2: If N is a Leaf node, each entry E which intersects with Q retrieve the object pointed by E.Object_Pointer. Insertion: During insertion three cases are to be considered [Gunt88, Ooib88]. First case is when the covering rectangles of all entries do not intersect with the object to be inserted and the second case is when the object intersects with the rectangles of all entries partially. These cases should be taken care of such that both duplication of objects and the area of bounding rectangles are minimized. The third and important case is when there is some dead space in a node which cannot be covered by the rectangles in the node. This according to [Gutt84] requires one or more of the covering rectangles to be split. This will result in the entries in the child nodes to be split and degrade the storage efficiency of the structure. Insertion involves traversing the tree and adding object into the leaf node. The object may be added to more than one leaf node as it may be broken to sub-regions along existing partitions. When nodes are split during overflows, the split is propagated to parent as well as children nodes. This is similar to the downward split of the K-D-B-Trees. The insertion algorithm traverses the tree until the leaf nodes are reached. At each intermediate node, entries whose MBR overlap with that of the object to be inserted, are traversed. Once the leaf node is reached the object is inserted. If after insertion there is an overflow, split the node and propagate the split to re-organize the tree.

15

Insert(N,O) Input: Let N be the root of the R-Tree and O be the object to be inserted. Output: New R+-Tree with O inserted. Step 1: If N is a Non-Leaf node, for each entry E whose E.MBR overlaps with O.MBR, Insert(E.Child_Pointer,O) is invoked. Step 2: If N is a Non-Leaf node and O does not overlap with any entry, Select an entry E whose MBR needs least enlargement to include O, Resolve ties be selecting one with the smallest area and Insert(E.Child_Pointer,O) is invoked. Step 3: If N is a Leaf node, insert O into N. If N has more than M entries, Split(N) is called, which re-organizes the tree. Step 4: If there is no overflow, the MBRs of entries along the path are adjusted. Deletion: First the objects that must be deleted are located and then are removed from the leaf nodes. During deletion more than one entry may have to be removed from the leaf nodes as the insertion routine may have added entries in more than one leaf node. When nodes become underutilized due to lot of deletions, periodic re-organization is required. The entries in the under- utilized node are re- inserted at the top of the tree. During deletion the tree is traversed till the leaf node is reached. At each intermediate node, the entries whose MBRs overlap with that of the object are selected to be traversed. At the leaf node, the entry is removed and the parent rectangle that encloses the remaining children rectangles, is adjusted Delete(N,O) Input: R-Tree rooted at N and object to be deleted O. Output: New R-Tree with O removed. Step 1: If N is a Non-Leaf node, for each entry E, such that E.MBR intersects with O.MBR, Delete(E.Child_Pointer, O) is invoked. Step 2: If N is a Leaf node, O is removed from N and N.MBR is adjusted to enclose remaining entries in N. Step 3: The MBRs of all the entries along the path are adjusted. Node -Split: During an insertion, when a leaf node overflows and a split is required, the split attempts to reduce the identifier duplications. Similar to K-D-B-Tree, the split of a leaf node may propagate upwards to the root of the tree and split of an intermediate node may require downward propagation to the leaf nodes. The split algorithm finds a partition for the node to be split, creates two new nodes and if needed propagates the split.

16

Criteria used during partition are: Nearest neighbor Minimal total x- and y- displacement Minimal total space coverage accrued by the two sub-regions and Minimal number of rectangle splits. First three reduce search by reducing area of dead space and the fourth controls the expansion of height. All the four criteria cannot possibly be satisfied at the same time. In the algorithm the rectangles are sorted in all dimensions and so the complexity is of order N log N. Sweep routine is used to scan rectangles and identify points where space partitioning is possible. The fill- factor determines how much populated is the tree. The more the packed the faster is the search. So if database is static it is desirable to pack the tree to capacity. Split(N): Input: Node to be split N of the R+-Tree and fill- factor f. Output: New re-organized R+-Tree after splitting N. Step 1: Let Lx and Ly be the lowest x- and y- coordinates of the entries in N. Step 2: Along each axis, the cost and location of cut are determined from Steps 3 and 4. Step 3: Starting from the lowest value along the dimension, first f rectangles from the rectangles sorted along the axis are selected. Step 4: The cost of organizing rectangles along the axis is computed based on minimal splits, minimal coverage and other properties. Step 5: The axis which gives the smallest cost is selected as the axis along which the space is divided into two regions T1 and T2. Step 6: The entries which are completely enclosed by region T1 are added to node N and those enclosed completely by T2 are added to node N. Step 7: If N is a Leaf node and an entry overlaps both T1 and T2, then it is added to both N and N. Step 8: If N is a Non-Leaf node and an entry overlaps both T1 and T2, the children nodes are split recursively along the partition axis by invoking Split(). Step 9: If N is the root node, a new node is created with two children. Step 10: If N is not a leaf node, Ns entry in its parent node P is replaced by N and N. If P overflows, then Split(P) in invoked.

17

R+-Tree is a variant of R -Tree without any overlap among the intermediate nodes. It requires more space than R-Tree as it adds entries to more than one leaf node. The pack algorithm attempts to setup R+-tree with good search performance. The performance of R+-tree is immune to changes in the distribution of segment sizes. When the number of large segments approaches the total number of segments R+-tree suffers due to lot of splits to sub-regions. The main advantage is the improved search performance in case of point queries. It behaves like a K-D-B-tree when the data is points.

4.1.3 R*-Tree
Minimizing both coverage and overlap is difficult in optimizing the performance of R Tree. R*-Tree [Krei90] is one of the most successful variants of R -tree. It is based on careful study of R-Tree algorithms under various data distributions and has the same structure as the R-Tree. R*-Tree introduces margin of the covering rectangles as an additional optimization criteria. This criteria is based on the fact that clustering rectangles with little variance of lengths of the edges tend to reduce the area of the clusters covering rectangle. It is a dynamic structure as insertions and deletions can be performed with no periodic reorganization. As in R -Tree, multiple paths may have to be traversed for search operations. R*-Tree is efficient structure for both point and spatial objects. It introduces the concept of Forced Re-Insertion which forces part of the entries to be reinserted during insertion. This helps in achieving dynamic reorganization of the R*-Tree structure.

4.1.3.1

R*-Tree Vs R-Tree

R*-Tree, in addition to the area criterion of R-Tree, uses margin and overlap of each enclosing rectangle. Minimizing area reduces the dead space and improves search performance. Storage utilization is improved by minimizing the height of the tree. It uses the concept of Forced Re- insert when there is an overflow. Advantages of R*-Tree over R-Tree: During insertion, the algorithm follows nodes whose MBR has the minimum increase in overlap, thus the search performance is improved. Whenever there is an overflow the node is not split, but some entries are deleted and re-inserted to sibling nodes. The entries for re-insertion are chosen to be those, which are at maximum distance from the center of the nodes MBR. This feature increases storage utilization and improves the quality of the partition, making it almost independent of the sequence of insertions. The Split Algorithm first selects the axis with respect to which there will be split, and then the projection of MBRs over the split-axis is sorted with their lower coordinates. Different distributions are formed and the one that results in a minimum overlap between the MBRs is selected. This algorithm achieves better quality of the MBBs partition over the tree.

18

Structures of R-Tree and R+-Tree depends on the order of insertion of the data objects. R*-Tree uses the concept of Forced Re-Insertion which helps the object in finding a node where it can be inserted with improved performance in storage and search. The number of overlaps is minimized as compared to R-tree. Also it avoids identifier duplication as in R+-Tree [Sell87].

Figure 3a: R*-Tree

Fig 3b: R*- Tree Structure Figure 3 illustrates the R*-Tree structure for the running example. The data space is partitioned into three MBRs R1, R2 and R3. It can be seen that the R*-Tree structure has fewer overlaps and no duplicates as compared to R-Tree and R+-Tree. The search path for an exact match query to find the object g is R2? R7? g.

4.1.3.2

Salient Features

R*-Tree has the same structure as R-Tree. The leaf nodes are of the form (MBR, Record_Pointer) and non- leaf nodes are of the form (MBR, Child_Pointer). Let M be the

19

maximum number of entries possible in a node, let m = M/2 and d be the number of dimensions. The design of R*-Tree introduces a policy called forced-reinsert. If a node overflows, it is not split immediately. Instead the first p entries from the node are reinserted into the tree, with the parameter p varying. R*-Tree has the following properties: The leaf has at least two children, unless it is a leaf. Every non- leaf node has between m and M children unless it is the root. Every leaf node contains between m and M entries unless it is the root. For each entry in a non- leaf node, the MBR is the smallest rectangle that spatially contains the rectangles in the sub tree pointed by the child node. All leaves appear in the same level.

The criteria used for optimization are: Minimizing the area covered: Minimizing the dead space improves the performance since decisions on which paths have to be traversed can be taken on higher levels. Minimizing MBR Overlap: This decreases the number of paths traversed. Minimizing Margin of MBR: Margin is the perimeter of the rectangle. By minimizing the margin, the rectangles will be shaped more quadratic. Queries with large quadratic query rectangles will profit from this optimization. Optimizing Storage Utilization: Higher storage utilization reduces the query cost as the height of the tree will be kept low.

4.1.3.3

Operations

Search: Given a query region, searching involves retrieving all data objects that overlap with the query region. At each level of the search the nodes whose MBR overlap with the given query region are selected and traversed and when leaf- nodes are reached the entries whose MBRs overlap are selected. Search(N,O) Input: R*-Tree rooted at N and Q be the query region given by the user. Output: All Objects in the query region Q. Step 1: If N is a leaf node, for each entry E whose E.MBR intersects with the query region Q, retrieve the object pointed by E.Object_Pointer Step 2: If N is a Non- Leaf Node, for each entry E whose E.MBR intersects with Q, Search (E.Child_Pointer, Q) is invoked. Insertion: First step in insertion is to select a leaf node where the new object can be inserted. R*-Tree uses combination of the parameters area, margin and overlap to select a suitable entry to be traversed in the intermediate node. If the intermediate nodes point to leaf nodes, the entry having least overlap enlargements is selected. In case of ties the 20

entry requiring least are enlargement is selected. For other intermediate nodes, the entry requiring least area enlargement to include the new object is selected. The main idea in the case of non- leaf nodes is to select an entry, which needs least area to include the new data and in the case of leaf nodes the entry, which needs least overlap enlargement is selected. In case of overflows first some entries are reinserted, this can often avoid splitting. If E1 Ep are entries in the node N, for an entry Ek the overlap value is given by, p Overlap(Ek ) = ? i=1 Area(Ek .MBR n Ei.MBR), 1 = k = p and i ? k Insert(N,O) Input: R*-Tree rooted at N and O be the object to be inserted. Output: New R*-Tree with O inserted. Step 1: If N is a parent of leaf nodes, an entry E whose rectangle needs least overlap enlargement to include O is selected. Ties are resolved by choosing entry which needs least area enlargement and then one having least area. Insert(E.Child_Pointer, O) is invoked. Step 2: If N is any of other Non-Leaf nodes, an entry E which needs least area enlargement to include O is selected. Ties are resolved by choosing the entry with least area. Insert(E.Child_Pointer, O) is invoked. Step 3: If N is a Leaf node, O is accommodated in N. Step 4: If N overflows, if N is not the root and for the given object executes Step 4 for the first time then ReInsert(N) is invoked else Split(N) is invoked. Step 5: If P is the parent node of N with entry E pointing to N, then adjust E.MBR to contain all objects in N. Step 6: If there was a split resulting a new node N, a new entry E with E.MBR containing the objects in N and E.Child_Pointer pointing to N is created and inserted into P. Step 7: If P overflows, let N = P and Step 4 is followed. Step 8: If the root gets split, a new root is created with E and E pointing to N and N as its entries. Step 9: The MBR of all the entries in the path are adjusted to include all its children nodes. Node -Split: Goodness values like area-value, margin- value and overlap value are used. Area-value is the total area of the MBRs of the rectangles in the two groups; Margin-

21

value is the sum of perimeters of the MBRs of both groups and Overlap- value is the area intersected by the two groups. For any two groups G1 and G2, Margin-Value = Margin[MBR(G1)] + Margin[MBR(G2)] Area-Value = Area[MBR(G1)] + Area[MBR(G2)] Overlap-Value = Area[MBR(G1) n MBR(G2)] The entries are sorted along each axis. For each sort, (M-2m+2) distributions with the M+1 entries are possible, wherein the kth distribution, the first group contains the first (m1+ k) entries and the second group contains the remaining entries The axis which has the least sum of margin- values is chosen as the split axis the best distribution along the axis is selected. For each axis the entries have to be sorted, this requires O(M log M) time. For each axis the margin of 4*(M-2m+2) rectangles and the overlap of 2*(M-2m+2) distributions have to be calculated. Split(N) Input: Node to be split N of the R*-Tree. Output: New re-organized R*-Tree. Step 1: The entries are sorted first by their lower values and then by the upper value of their rectangles, along each axis. Step 2: For each sort, all of the (M-2m+2) distributions are determined. Step 3: Along each distribution the sum of all margin values are determined. Step 4: The axis with the least sum is chosen as the split axis. Step 5: Along the selected axis, the distribution having minimum overlap-value is selected. In case of ties, the distribution with minimum area-value is selected. Step 6: The entries are distributed into two groups according to their area-values. Forced Re-insert: R*-Tree uses Forced-Reinsertion for dynamic reorganization during insertion. If a node overflow occurs, a defined percentage of the objects with the highest distances from the center of the region are deleted from the node and inserted again. By this means, the storage utilization will improve and the quality of partitioning improves because unfavorable decisions in the beginning of index construction can be corrected this way. If all the entries are inserted in the same location during forced reinsertion, the node is split. Following are the advantages of using Forced Re- insertion: Forced Reinsert changes entries between neighboring nodes and thus decreases the overlap.

22

As a side effect storage utilization is improved. Due to more restructuring less split occurs. Since outer rectangles of a node are reinserted, the shape of the MBRs will be more quadratic, which is a desirable property.

The CPU cost will increase as insertion routine is called more often but there will be less splits. Average disc access increases slightly if Forced-Reinsert is applied to R*-Tree, but it improves the structure. Reinsert(N) Input: Node N which overflows during insertion and number of ent ries p to be reinserted. Output: Re-organized R*-Tree without split of N. Step 1: For all the M+1 entries in N, the distance between the center of their MBR and the center of the MBR of all the M+1 entries, is computed. Step 2: The entries are sorted in decreasing order of their computed distances. Step 3: First p entries are removed from N and its MBR is adjusted. Step 4: For each entry E to be reinserted, Insert(Root,E) is invoked where Root is the root of the R*-Tree. Deletion: Deletion routine of R*-Tree is same as that of the R -Tree. During deletion, using the search algorithm, the leaf node having the entry to be deleted is determined. Then the entry is removed from the leaf node. The MBRs of the entries along the path are updated. If deletion causes underflows in the leaf nodes, the entries are removed, the node is deleted, the tree gets updated and all the entries are inserted again into the tree using the insertion routine. Delete(N,O) Input: R*-Tree rooted at N and object to be deleted O. Output: New R*-Tree with O removed. Step 1: Search(N,O.MBR) is invoked to get leaf node L which contains O. If L is not found, the process is terminated. Step 2: O is removed from L. Let N = L and Q be set of entries from under- filled nodes. Step 3: If N is the root node, Step 7 is followed. Else P be the parent of N and E be its entry in P. Step 4: If N has less than m entries, remove E from P and add entries of N to Q. Step 5: If N has more than m entries, E.MBR is adjusted in P.

23

Step 6: Let N = P and process is repeated from Step 3. Step 7: For each entry E in Q, Insert(N, E) is invoked. R*-trees are based on reduction of area, margin and overlap of directory rectangles and effectively supports point and spatial data at the same time. Its implementation cost is only slightly high than that of other R-Trees. It is robust against ugly data distribution. The average insertion cost is lower than the R-Tree. It differs from the R-Tree mainly in the insertion and node split algorithms. Algorithms for Deletion and Search remain the same. In future fan out can be reduced and R*-trees can generalized to handle polygons efficiently.

4.1.4 SS-Tree
Similarity Indexing is required in many applications to facilitate efficient similarity queries of a dataset of typically high dimensional feature vectors. For example a query on content based image database could find pictures with similar color or texture. Similarity indexing has three main components: Objects represented by high dimensional feature vectors, querying feature vectors based on one or more (dis)similarity and different types of fundamental queries. Usually the knowledge of a domain expert is required for representing data objects as feature vectors. Query performance is more important than update performance in image and multimedia databases. But dynamic updating of the database should be supported. SS-Tree [Jain96] is an improvement of R*-Tree [Krei90] and has similar configuration as that of R-Tree [Gutt84]. To avoid exhaustive searches and save space, all the elements in the feature vector should be used for indexing. In order to use SS-Tree the following are to be considered: Feature Vectors are to be given in a format such that the Euclidean distance metric on those features can approximate the desired measure of dissimilarity. Knowledge of the domain should be used to constrain the similarity measures. . 4.1.4.1 SS-Tree Vs R*-Tree SS-Tree uses Minimum Bounding Spheres (MBS) instead of minimum bounding rectangles in its non- leaf nodes and leaf nodes. Contain feature vector along with the data. The data space for d dimension is reduced as it requires only (d+1) memory units, whereas rectangles require 2d memory units. Due to its smaller storage space SS-Tree can have high fan-out than R-Tree and its variants, resulting in a tree with lower height. The longest distance between any point and the center of a sphere is its radius, which is a constant as compared to rectangles having varying distances with any change in the number of dimensions. 24

In high dimensional data space spheres is expected to generate better data groupings, which contribute to the data retrieval performance. SS-Tree reinserts entries unless reinsert has been made at the same node or leaf, whereas in R*-Tree reinsertion takes place unless reinsertion has been made in he same level. This promotes the dynamic reorganization of the structure. Salient Features

4.1.4.2

Leaf nodes of SS-Tree contain entries of the form (Feature_Vector, Data), where data holds the data for the leaf. Non-leaf nodes contain entries of the form (Centroid, Radius, Child_Pointer). Centroid is the mean value of the feature vectors in the child node and the radius is the distance from center to the outermost feature vector. For d dimensional feature space requires d+1 memory units to store an entry. Divides points into isotropic neighborhoods into bounding spheres of shorter diameter regions. For each entry in a non- leaf node, the bounding sphere contains the entries in the subtree pointed by the child node. Each node, except the root node has a minimum of m and maximum of M entries.

Figure 4a shows the 2-Dimensional representation of the high-dimensional feature space. The objects which consist of both point and spatial data are bounded by Minimum Bounding Spheres, circles in case of 2-dimensional. MBSs S1, S2 and S3 covet the entire data space. The MBSs S4 to S10 in the intermediate nodes cover the data objects in the data space. It can be observed from the figure that multiple paths have to be traversed during search operations. For example the exact match query for the object f has S2? S7? f and S1 as the search paths.

Figure 4a: SS-Tree

25

Figure 4b: SS-Tree Structure

4.1.4.3

Operations

The algorithms for data search, insertion and deletion from R -Tree and R*-Tree can be used in SS-Tree with some modifications in routines used for insertion and nodesplitting. Search: The search algorithms search regions in order of minimum distance from the query point until the query results are guaranteed correct to required accuracy. Two priority queues are used: a search queue and result queue. For similarity sampling, the same algorithm proposed for R-Tree can be used for SS-Tree and it provides fast sampling using only the internal nodes. The search algorithm involves traversing from the root to the leaf nodes. At each level, the entries whose bounding sphere overlaps with the query region are selected for traversal. At the leaf node, the objects which overlap with the query region are retrieved. Search(N,Q) Input: SS-Tree rooted at N and Q be the query region given by the user. Output: All Objects in the query region Q. Step 1: If N is a leaf node, for each entry E whose bounding sphere intersects with Q, E.Data is retrieved. Step 2: If N is a Non-Leaf Node, for each entry E whose bounding sphere intersects with Q, Search(E.Child_Pointer,Q) is invoked. Insertion: The Insertion algorithm is similar to that of R*-Tree in that it uses the concept of Forced Reinsert. Each node has a minimum of m and maximum of M entries. Initially if the tree has only the root node, a new node array is created and the object is inserted, else the tree is traversed to select a leaf node where the object has to be inserted. At each 26

level of the tree, the entry whose centroid is closest to the feature vector of the object to be inserted is selected. Every entry traversed for insertion is updated by adjusting the values of its centroid and radius. Once the leaf node is reached, the object is added to it. If there is an overflow and nodes children have not been already reinserted, they are reinserted. If reinsertion cannot be applied during overflow of a node, it is split. For reinsertion, the entries are sorted in descending order based on their distances from the centroid of the Bounding sphere of the node. Then first p entries of the list are selected for reinsertion. These p entries are removed from the node and the bounding box of the node is adjusted. Then the entries are inserted one by one into the tree. Insert(N,O) Input: SS-Tree rooted at N and the object to be inserted O. Output: New SS-Tree with O inserted Step 1: If the tree has only the root node, create a new node array and insert the new entry. Step 2: If N is a Non- Leaf Node, the entry E, whose E.Centroid is closest to O.Feature_Vector, is selected. E.Centroid and E.Radius are updated and Insert(E,O) is invoked. Step 3: If N is a Leaf- node, O is added to N. Step 4: If N overflows and its entries have not been already reinserted, for each entry E in N Reinsert(E) is invoked. Step 5: If N overflows and Reinsert cannot be applied, Split(N) is invoked. Reinsert(E) Input: N the root node of SS-Tree, Node E whose entries have to be reinserted, p the number of entries to be reinserted. Output: Reorganized SS-Tree. Step 1: The entries in N are sorted based on their distance from the cent roid of the bounding sphere of the node N. Step 2: The first p entries are selected. Step 3: The p entries are removed from N and the tree is updated. Step 4: For each entry E, Insert(N,E) is invoked.

27

Node -Split: During insertion, when there is an ove rflow and reinsertion could not be applied, i.e. all the reinserted nodes are inserted in the same node; the node has to be split. The split algorithm initially determined the variance of the entries in each dimension and selects the dimension with highest variance as the split dimension. The dimension with highest variance is selected as the split dimension. Along the dimension a split location to minimize the sum of variances along both sides of the split is selected. The entries on both sides of split location are assigned to two new nodes. If the root gets split, a new root array is allocated and the two parents are entered in it, else among the two new nodes, the one which is closest to the parent is retained in the parent node and the other node is reinserted. Split(N) Input: Node N whose entries are to be reinserted. Output: New Re-organized SS-Tree. Step 1: For the entries in N, the dimension with highest Variance is selected as the split dimension. Step 2: The Split location is selected so as to minimize the sum of variances of each side of the split. Step 3: Two new parent nodes, E and E, are created and the split elements are assigned to them. Step 4: If E is a root node, a new root array is allocated and E and E are written to it. Step 5: If E is not a root node and P is the parent node of E, among the two new nodes E and E, the one which is closest to P is retained in P and Reinsert() is invoked for the other node. Deletion: The tree is traversed from the root node till the leaf node is reached. At each intermediate node, the entry which is closest to the object to be deleted is selected for traversal. In the leaf node, the entry is removed and it is adjusted. In case of underflows due to the deletion, the entries in the node are removed, the tree is updated and the entries are inserted one by one into the tree. Delete(N,O) Input: N the root node of SS-Tree and O the object to be deleted. Output: New SS-Tree with O deleted. Step 1: If N is a Non-Leaf node, entry E whose E.Centroid is closest to O.Feature_Vector, is selected. Delete(E,O) is invoked. Step 2: If N is a Leaf node, O is removed from N and the tree is updated.

28

Step 3: If there is any underflow in N, all of the remaining entries are removed, the tree is updated and Reinsert(N) is invoked. SS-Tree is an indexing structure created for the main purpose of similarity indexing. In SS-Tree less information is needed to store the bounding spheres, which results in larger fan-out. The diameters of the bounding spheres are insensitive to dimensionality and improve the query performance. Bounding spheres occupy more volume than bounding rectangles. As the dimensionality increases, more overlap between the bounding spheres affects the query performance. It is suited for approximate queries than R*-Tree. For higher dimensional data, SS-tree provides faster query performances than R*-tree. It requires significantly less CPU time to insert elements because it uses linear algorithm as compares to others. Its storage utilization is greater than that of R*-tree.

4.1.5 SR-Tree
A feature vector is extracted from image characteristics like hue, saturation, intensity and texture and stored in database along with the images. A set of images to a particular image can be retrieved by searching feature vectors close to that of the given image. SSTree was proposed for similarity queries based on Feature Vectors close to that of the given image. It performs better than the structures R*- Tree and K-D-B-Tree. Sphere/Rectangle-Tree or SR- Tree [Kata97] can be regarded as a combination of the R*Tree and SS-Tree and outperforms them. It uses the intersection solid between a rectangle and sphere as the bounding region. Both bounding rectangles and spheres have their own merits and demerits. Bounding Rectangles divide points into small volume regions, but have larger diameters. Bounding Spheres divide points into short diameter regions, but have larger volumes than bounding rectangles. Spheres are better suited for processing nearest neighbor queries and range queries, but are difficult to maintain and tend to produce much overlap in splitting.

4.1.5.1

SR-Tree Vs SS-Tree and R*-Tree

SR-Tree uses both the minimum bounding spheres and minimum bounding rectangles in its nodes and leaves. It outperforms SS-Tree by reducing the volume and diameter of regions as compared to SS-Tree and R*-Tree. This improves the disjointness among regions and enhances the performance on nearest neighbor search. Fan-out of SR-Tree is about one-third of SS-Tree and two-thirds of that of R*-Tree. Even though it suffers from fan-out problem it has less number of disk reads than that of SS-Tree.

29

4.1.5.2

Salient Features

Employs both bounding spheres and bounding rectangle. Specifies a region by the intersected by the bounding sphere and bounding rectangle. Leaf nodes contain entries of the form (Feature_Vector, Data) where data holds the data for the leaf . Nodes contain between m and M entries where m >= M/2. Non-Leaf nodes contain entries of the form (Centroid,Radius,MBR, n, Child_Pointer) , where MBR is the minimum bounding rectangle, and n is the total number of data entries stored in the sub-tree pointed by the child_pointer.

Figure 5a: SR-Tree

Figure 5b: SR-Tree Structure

30

SR-Tree structure for the running example is shown in Figure 5. Its structure is based on that of R-Tree [Gutt84]. In the figure, the boxes indicate the bounding rectangles of the objects enclosed by it and the circles represent the bounding sphere. The intersection of these two regions forms the bounding region for SR-Tree. Multiple paths need to be traversed during search operations. For example exact match query for object f involves searching the paths S2? S7? f and S1.

4.1.5.3

Operations :

The algorithms for data search, insertion deletion and node-split are derived from the corresponding algorithms used by the SS-Tree, R-Tree and R*- Tree. The modifications are mainly for the updates of the bounding spheres and the bounding rectangles during data insertions and deletions. Search: The search algorithms search regions in order of minimum distance from the query point until the query results are guaranteed correct to required accuracy. Two priority queues are used: a search queue and result queue. The search algorithm traverses from the root to the leaf nodes. At each intermediate node the entries which overlap with the query region are traversed. At the leaf node, the entries whose Feature_Vector overlap with the query region are retrieved. Search(N,Q) Input: SR-Tree rooted at N and Q is the query region given by the user. Output: All Objects in the query region Q. Step 1: If N is a Non-Leaf node, for each entry E whose region overlaps with Q, Search(E,Q) is invoked. Step 2: If N is a Leaf node, all the entries whose Feature_Vector overlaps with Q are retrieved. Insertion: Due to its effectiveness for nearest neighbor search, SR- Tree uses the centroid-based algorithm of the SS-Tree for insertion. To insert a new entry, the algorithm descends the tree from the root to the leaf node. At each intermediate level of the tree, the entry which is closest to the objects Feature_Vector is selected for traversal. Every entry traversed is updated by adjusting its Centroid and Radius. After insertion if the node overflows, first its entries are reinserted. If reinsertion results in entries to get assigned to the same node, the node is split. Updation of Minimum Bounding Rectangle is similar to that of R-Tree. If d is the dimension of the data space, Ek be the kth entry in the child node Ek .w be the total number of objects in the sub- tree pointed to by Ek and n be the number of entries in the node.

31

The center of the bounding sphere, n xi = ? k=1 Ek .xi * Ek .w n ? k=1 Ek .w Let ds be the maximum distance from the center of a parent node to the bounding spheres of its children and dr the maximum center of a parent node to the bound ing rectangles of its children. ds = max [ || x - Ek .x || + Ek .r ] 1=k=n dr = max [MAXDIST(x,Ek .MBR )] 1=k=n

Where MAXDIST() computes the maximum distance from a point to a Minimum bounding rectangle. The radius of the bounding sphere, r = minimum (ds, dr) Insert(N,O) Input: SR-Tree rooted at N and the object to be inserted O. Output: New SR-Tree with O inserted Step 1: If N is a Non-Leaf node, the entry E whose E.Centroid is closest to O.Feature_Vector is selected. Insert(E,O) is invoked. Step 2: If N is a Leaf node, O is inserted. Step 3: If N overflows, If Reinsert was not already called, Reinsert(N) is invoked else Split(N) is invoked. Step 4: The SR-Tree is adjusted and the Minimum Bounding Spheres and Rectangles are updated along the path. Reinsert(E) Input: Root node N, Node E whose entries have to be reinserted, p the number of entries to be reinserted. Output: Reorganized SS-Tree. Step 1: The entries in N are sorted based on their distance from the centroid of the bounding sphere of the node N.

32

Step 2: The first p entries are selected. Step 3: The p entries are removed from N and the tree is updated. Step 4: For each entry E Insert(N,E) is invoked. Node -Split: Split algorithm is similar to that of R*-Tree. The split algorithm calculates the variances of the entries in the node along each dimension, selects the dimension with highest variance and then the location of split. Then two new nodes are created and the entries are assigned to them. If the node is not a root node the node which is close to the parent is retained and the other node is reinserted into the tree. The Bounding Boxes are adjusted along the path in the tree.

Split(N) Input: Node N whose entries are to be split. Output: New Re-organized SR-Tree Step 1: The Variances of coordinates along each dimension are calculated from the centroids of its children. Step 2: The dimension with the highest variance is selected as the dimension for splitting. Step 3: The Split location is selected so as to minimize the sum of variances of each side of the split. Step 4: Two new parent nodes, E and E, are created and the split elements are assigned to them. Step 5: If E is a root node, a new root array is allocated and E and E are written to it. Step 6: If E is not a root node and P is the parent node of E, among the two new nodes E and E, the one which is closest to P is retained in P and Reinsert() is invoked for the other node. Step 7: The MBBs are adjusted along the path. Deletion: Deletion algorithm is common with the R*-Tree and SS-Tree. Given an entry to be deleted, the tree is traversed to reach the leaf node having the object to be deleted. At each level the entry whose centroid is closest to the object to be deleted is selected. If the deletion of the entry does not cause any underflow, the object is removed from the tree, else, the node is removed from the tree, the entries are updated and orphaned entries are reinserted into the tree.

33

Delete(N,O) Input: Root node N and O the object to be deleted. Output: New SR-Tree with O deleted. Step 1: If N is a Non-Leaf node, entry E whose E.Centroid is closest to O.Feature_Vector, is selected. Delete(E,O) is invoked. Step 2: If N is a Leaf node, O is removed from N and the tree is updated. Step 3: If there is any underflow in N, all of the remaining entries are removed, the tree is updated and Reinsert(N) is invo ked. SR-Tree is one of the latest structures that feature the combination of both minimum bounding rectangles and minimum bounding spheres. By combining both, it inherits the advantages of SS-Tree and R*-Tree. By introducing bounding rectangles neighborhood could be partitioned into smaller regions and improves disjointness among regions. It reduces the volume and diameter of regions as compared to SS-Tree and R*-Tree. The creation cost of SR-Tree is higher than that of SS-Tree. The fanout is small. The size of the node entry is three times larger than that of SS-Tree and one-and-half of that of R*Tree. As SR-tree saves leaf- level reads more than increase of node- level reads, its total disk reads is less than that of SS-tree. SR-tree is effective from lower dimensionality to the higher dimensionality and improves the performance as compared to SS-tree and also is more effective for less uniform data sets, which can be practical in actual image/video similarity. Although its creative cost is more, SR-tree outperforms SS-tree for applications requiring index structures that are efficient for high-dimensional nearestneighbor queries.

4.1.6 TV-TREE
Structures like R-Tree [Gutt84] and its variants [Sell87, Krei90] and Grid files [Sevc94], when extended to higher dimensions, grow exponentially in space and time. TelescopicVector-Tree or TV-Tree [Jaga94] prevents this dimensionality problem by using varied number of dimensions for indexing based on the objects and their levels, i.e feature vectors can expand and contract dynamically. It organizes data in hierarchical structure with details of the Minimum Bounding Box (MBB) of enclosed objects stored in the parent node. Nodes near the root have high fanout and, by using the basic features. As more and more objects are added, new features are introduced at lower levels to discriminate the objects. Karhunen-Loeve Transform [Kein90] is used to order the features in order of their importance. It sorts the features of given vectors based on their discriminatory power. If either the data or the statistical properties of the data are known in advance, a transform can be obtained with efficient performance results. Depending on the application the shape of the MBB can be selected with sphere being the simplest one among rectangle, cube, diamond etc.

34

4.1.6.1

TV-Tree Vs R-Tree and its variants

For R- Tree and its variants, although conceptually they can be extended to higher dimensions, they usually require time and space that grows exponentially with dimension and reduce to a sequential scanning. Insertion cost is cheaper in TV- Tree due to the fact that TV-Tree is shallower than the corresponding R*-Tree. The number of disk accesses for a search in TV- Tree is lower than that of R*-Tree. The savings in total disk accesses during search increases with increase in size of the database, which indicates that it scales well. As object size increases the leaf fanout decreases making TV- Tree grow faster, but this does not affect the search performances much. TV-Tree requires fewer number of nodes and hence less storage space. The space savings in TV-Tree are from the internal nodes, which mean that the nonleaf levels will require a small buffer, which can be significant when buffer size is limited. TV-Tree can be used for high dimensional feature spaces without this dimensionality problem. The feature vectors contract and expand dynamically in the TV-Tree. Compared to trees that use fixed number of features, TV-Tree provides higher fanout at top levels, using only few features.

4.1.6.2

Salient Features

Each MBB is represented by its centre and radius. The MBBs can overlap and all the entries of a node's sub-tree are contained in the node. More than one level can have the same number of active dimensions. Each entry of the intermediate node is of the form (center, radius), where 'center' is the telescopic vector of the center of the MBB, containing the active dimensions of feature vectors of object enclosed. Telescopic vector is of the form (f_list [], n), where f_list is the array of feature values for the active dimensions and n is the number of active dimensions. At the level where a dimension turns from active to inactive, its coordinates are stored. The number of active dimensions is kept constant at a level.

The structure of the TV-Tree for the running example is given in Figure 6a. The objects represented as single lettered alphabets are stored in the leaf nodes. These objects are bounded by the Minimum Bounding Spheres S1 to S7. The root node has the entries S8 and S9 bounding the MBSs S1 to S7.

35

. Figure 6a: TV-Tree

Figure 6b: TV-Tree Structure

4.1.6.3

Operations

Search: Given a query region the algorithm has to retrieve all the points in the query region. The Search algorithm starts at the root and traverses branches, which intersect with the given query region recursively. As there are overlaps, multiple paths may have to be traversed. For nearest neighbor queries, given a query point the top level branches are examined, their lower and upper bounds for distance are computed and the branch which is most promising is selected for descend and the branches that are too far away are disregarded. Search(N,Q) Input: TV-Tree rooted at N and the query region Q. Output: Points In the query region. Step 1: If N is a leaf node, check all the entries and return those entries S, whose MBR overlaps with the query region. 36

Step 2: If N is not a leaf node, for each entry in S, whose MBRs overlap with the query region, do the search steps. Insertion: When a new object has to be inserted, the tree is traversed, at each step selecting an entry which is suitable to contain the new object. After insertion, if there is an overflow in the leaf node, either some entries are re- inserted or the node is split. The MBRs are updated for the nodes along the path. During insertion contraction can occur resulting in a MBR with lower dimensionality. The split algorithm divides the entries into two groups with at least ff percentage of space utilization, where ff is the parameter used for performance. Splitting can be done either by clustering or by ordering the entries in a node. To select an entry in an intermediate node, the following criteria are used in the given order of priority: 1. Minimum increase in overlapping regions within the node. This involves selecting an entry E such that after updating E.MBR, the number of overlaps among the entries within the node is minimized. 2. Minimum decrease in dimensionality. In this the entry E is selected so that it can accept the new object by contracting its center as little as possib le. 3. Minimum increase in radius. 4. Minimum distance from the center of the MBR to the object. Insert(N,O) Input: TV-Tree rooted at N and object to be inserted O. Output: Reorganized TV-Tree. Step 1: If N is a Non- leaf node, an entry E which needs least overlap enlargement to include O is selected. In case of ties, other criteria such as minimum decrease in dimensionality, minimum increase in radius and minimum distance from center can be used in given order of priority. Step 2: Let N = E. N is subjected to Step 1 recursively till N is a leaf node. Step 3: O is inserted into N. Step 4: If there is an overflow in L, If it is first time during insertion, The entries in N are sorted in descending order of their distances from center and from the list first p entries are removed and for each entry E, Insert(Root,E) is invoked, where Root is the root node of the TV-Tree. Else Split(N,ff) is invoked to get into two leaves N and N and the entry of N in its parent node is replaced by the entries N and N.

37

Step 5: MBRs that have been changed and Split an intermediate node if there is an overflow. Split: Splitting is done to redistribute the set of MBRs into two groups so as to facilitate operations and provide high space utilization. Splitting can be done by two waysOrdering and Clustering. Splitting by ordering : In this, the vectors are ordered and a best partition along the ordering is found. The following criteria are used to minimize area and to minimize the overlap. Minimum sum of radius of the two MBBs formed. Minimum of (Sum of radius of MBBs - Distance between centers).

Ordering can be done in different ways. Sorting vectors lexicographically, Space-filling curves like Hilbert-Curves and others can be used. Given node N to be split and the performance parameter ff which is the minimum percentage of space utilization of a node, two new nodes N1 and N2 have to be created. Order_Split(N,ff) Input: Node to be split, N and storage utilization ff. Output: Two nodes N and N. Step 1: The MBRs of the entries in N are sorted by ascending row- major order of their centers. Step 2: Two groups of entries in nodes N and N, each having at least ff storage utilization are created at a break-point such that the MBRs of the two groups have the least total radius. In case of ties minimum (Sum of radius of MBBs - Distance between centers) is used. Step 3: If no two groups with storage utilization of ff could be obtained, the entries are sorted by their byte size and Step 2 is followed. Splitting by Clustering : Given a node N to be split, the goal of the clustering technique is to group vectors so that similar ones will reside in the same MBR. Cluster_Split(N) Input: Node to be split N. Output: Two nodes N and N.

38

Step 1: Two most un-similar MBRs are selected from the entries in N. This can be done by choosing selecting the two having the smallest common prefix in their centers. In case of ties, the pair with the largest distance between centers is selected. Step 2: The selected entries head the two new groups created in nodes N and N. Step 3: Each of the remaining entries is added to a group based on the criteria: minimum increase in overlap, minimum decrease in dimensionality, minimum increase in radius and minimum distance from the center in the given order of priority. Deletion: Deletion involves searching the entry using exact match query and removing the entry from the node. Then the bounding boxes are updated. When underflow occurs, the entries in the node are removed and are re- inserted. When entries inside a node are redistributed either by reinsertion or by split, new MBRs have to be calculated and extending may be required by introducing new active dimensions, those on which all objects agree. Delete(N,O) Input: TV-Tree rooted at N and objected to be deleted O. Output: New TV- Tree with O removed. Step 1: Using the exact match search query, Search(N,O) the leaf node L which has the O is obtained. Step 2: O is removed from L and L.MBR is updated. Step 3: If L underflows, all the entries are removed from the node, its MBR is updated. For each entry E removed from L, Insert(N,E) is invoked. TV-Tree is a method for indexing high dimensional objects. It adapts dynamically and uses variable number of dimensions that are needed to distinguish between objects or group of objects. Since the number of dimensions required is usually small, the method saves space and leads to a larger fanout. The tree is more compact and shallower requiring fewer disk accesses. TV-Tree achieves access cost savings and at the same time has a reduction is the size of the tree resulting in storage cost. The savings increase with increase in dimension which indicates that the method can scale well.

4.1.7 X-Tree
R-Tree [Gutt84] and its variants [Sell87, Krei90] were designed for management of spatially extended two-dimensional objects, but can be extended for high-dimensional data. The major drawback of these index structures in high dimensional data spaces is overlap. In higher dimensions, generally there will be only one axis having relatively good performance, and this split axis may lead to unbalanced partitions. In these cases split should be avoided to prevent underfilled nodes.

39

X-Tree [Krei96] is an extension of R*-Tree [Krei90] based on the problems arising out of high dimensional data spaces. It can be seen as a hybrid of linear array- like structure and hierarchical R-Tree like structure. For higher dimensions, in case of high overlaps, most of the entries in the directory will be searched. Linear organization is more efficient in these cases as it needs less space and may read faster. X-Tree extends R*-Tree by two concepts: Overlap-free split according to a split history. Supernodes with enlarged page capacity. The Objective is to avoid overlap of bounding boxes by using an organization of the tree, which is optimized for high dimensional space. It avoids splits, which would result in a high degree of overlap in the directory. Instead of allowing splits that introduces overlap, nodes are extended over the usual block size, called as supernodes. The basic idea is to have tree as hierarchical as possible and to avoid splits that would result in high overlap. Recording the history of data page splits in R -Tree results in a binary tree having split dimensions as nodes and current data pages as leaf nodes. Whenever split results in unbalanced tree, with underutilized nodes, which affect the storage utilization, X -Tree doesnt split but creates an enlarged directory node instead.

4.1.7.1

X-Tree Vs R*-Tree

X-Tree outperforms both R*-Tree and TV-Tree [Jaga94]. It uses split history to select the split dimension. If split dimension selected using the split history results in high overlap, it extends the current to larger size called as supernode. Whenever there is high overlap due to split, X-Tree uses nodes of larger size called supernodes. Takes less time and has fewer page accesses for search queries. As number dimension increases, the CPU time is faster. This is due to the overlap in R*-Tree which results in searching large portion of the directory. Speedup of X-Tree over R*-Tree increases with increase in dimension, for queries. Insertion into X-Tree was faster than insertion into TV-Tree and R*-Tree. Information about the split history has to be stored in the intermediate nodes. Space needed for this is of the order of few bits.

40

Figure 7a: X-Tree

Figure 7b: X-Tree Structure Figure 7a illustrates the structure of X-Tree for the running example. Here the value of M = 4. The objects are bounded by MBRs R4 R9 at intermediate level and by MBRs R1 R3 at the root node. Objects bounded by the MBR R7 form the supernode. The five entries [g, D, I, J, h] were assigned to same node as splitting the node causes increased overlaps. In X-Tree as he dimensionality increases, the number of overlaps and hence the supernodes increases.

4.1.7.2

Salient Features

X-Tree has three types of nodes: data nodes, directory nodes and super node. Each entry in the directory node is of the form (MBR, Split- History, Pointer). Supernodes are large intermediate nodes of variable size, a multiple of the usual block size.

41

Due to the increase in number of supernodes, the height of X-Tree decreases with increase in dimension. When none of the nodes is a supernode, X-Tree is completely hierarchical and is similar to R-Tree. When root is the only supernode, the performance corresponds to a linear directory scan and the size of the directory linearly depends on the dimension.

4.1.7.3

Operations

The algorithms used in X-Tree are designed to automatically organize the nodes in hierarchy, such that the portions of data, which would produce high overlap, are organized linearly and those, which can be organized hierarchically without much overlap, are organized in hierarchical form Search: The search algorithm is similar to that of R*-Tree, since only minor changes are required in accessing the super nodes. Search algorithm searches the tree to retrieve all the entries which overlap with the query region. Search(N,Q) Input: X-Tree rooted at N and the query region Q. Output: All objects in the query region. Step 1: If N is a non-leaf node, for each entry E whose E.MBR overlaps with Q, Search(E,Q) is invoked. Step 2: If N is a leaf node, each entry E whose E.MBR overlaps with Q are returned. Insertion: The most important algorithm is the insertion algorithm as it determines the X-Tree structure, which is a suitable combination of hierarchical and linear structure. The main objective is to avoid splits, which would produce overlap. Insert(N,O): Input: X-Tree rooted at N and object to be inserted O. Output: New Reorganized X-Tree including O. Step 1: By performing exact match query, Search(N,O), leaf node L where the object has to be inserted is obtained. Step 2: O is inserted into L and L.MBR is adjusted. Step 3: If L does not overflow, the MBRs along the path to the leaf node are updated. Step 4: Split(L) is invoked to check whether the node could be split.

42

Step 5: First topological or overlap-minimal split is tried and if it is successful, a new node is added into the tree and the tree gets updated. Step 6: If there is no good split, super node is created. The entry of the super node is updated in its parent node. Split: For topological splits X -Tree uses the R*-Tree or other split algorithms. When a node overflows the splitting can be done in the following ways: Using the topological properties of MBRs like MBR-Extension and dead-space, the node is tried to split. If the above results in high overlap, the split history stored in the nodes is used. XTree selects the dimension with which the root of the split-history tree has been split. This is done to select a dimension over which all the data in the tree are split, which guarantees overlap- free regions. For lower dimensions, there may be more than one overlap- free split dimension, but the probability that a second split dimension exists which is a part of the split history of MBRs of all entries decreases with increase in dimension.

The overlap- free or overlap- minimal split requires information about the split history has to be stored in the intermediate nodes and can result in an unbalanced tree. In this case it would be advantageous not to split the node instead of splitting to create one underfilled and another almost overfilled node and also resulting in less storage utilization. In these cases X-Tree creates an enlarged directory node called as supernode. The higher the dimensionality the more supernodes will be created and larger the supernode become. If a supernode is created or extended and if there is not enough contiguous space on disk to sequentially store the supernode, a local reorganization has to be performed by the disk manager and this does not occur frequently. Split(N) Input: Node to be split N. Output: Reorganized X-Tree. Step 1: If N is a leaf node, topological splits based on the dead-space, MBR extension and other properties are used. Step 2: If N is a non- leaf node, topological split is tried resulting in two sets of MBRs. Step 3: If a topological split created nodes with high MBR overlaps, then split is selected from the split history. Step 4: For overlap- minimum split select the root of the split- history tree as the dimension to be split.

43

Step 5: If the overlap- minimum split results in an underfilled node the node is not split. In this case the current node is extended to become a supernode of twice the standard block size. Deletion: Delete operation is also a simple modification of the corresponding R*-Tree algorithm. The only difference occurs in the case of underflow of a supernode. If the supernode consists of two blocks, it is converted to a normal node. Otherwise, if the supernode consists of more than two blocks, its size is reduced. Delete performs the search for the leaf node having the entry and removes the entry. The update operation can be seen as a combination of delete and an insert operation. Delete(N,O): Input: X-Tree rooted at N and Object to be deleted O. Output: New Reorganized X-Tree. Step 1: An exact match query, Search(N,O) is performed to get the leaf-node L which has the object to be deleted O. Step 2: O is removed from L and L.MBR is updated. Step 3: If there is no underflow of L, the tree is updated. Step 4: If L underflows and it is not a supernode, its entries are removed, the tree is updated and for each entry E, Insert(N,E) is invoked. Step 5: If L is supernode which overflows, if L has more than two blocks it is reduced by one block, else it is converted into an ordinary leaf node. Step 6: The entries in the tree are updated along the path of the leaf node. R-Tree based index structures do not perform behave well for indexing high-dimensional spaces. X-Tree, which is an extension of R*-Tree, uses two new concepts of supernodes and an overlap- minimum split algorithm. For smaller dimension X-Tree behaves like RTrees. X -Tree has high performance gain compared to R*-Tree for all query types in medium-dimensional spaces. For higher dimensions, in case of overlaps, X -Tree uses linear scan which is less expensive. For very high dimensionality supernodes may become large which may affect the CPU time in query processing. It performs R*-Tree and TV- Tree up to two orders of magnitude.

4.1.8 PK-Tree
Generally objects in large image databases are described by a vector of numeric attributes. Many indexing techniques exist for such applications some of which requires simple and easy tree traversal. Quad-tree based methods such as PR-quad-tree [Same90] uses disjoint spatial decomposition technique and R- Tree [Gutt84] based techniques uses

44

minimum bounding boxes, which may overlap to divide the space recursively. Pyramid structure is a widely used technique in image processing. The root of the pyramid corresponds to the entire image. Then the space corresponding to the root is divided into quadrants down to pixel level. It has the following advantages: Physical location of a node can be calculated, since the number of nodes in a level is known. It provides a fast direct access. A node contains the summarized information of the area to which it corresponds, which can speedup the search time.

Pyramid K-instantiable Tree or PK-Tree combines the aspect of PR-quad-tree and KDTree [Bent75] but where the unnecessary nodes are eliminated. It achieves better bound on the height of the tree for skewed data distributions. It instantiates the non-leaf nodes, which have with at least a certain number of non-empty children. This restriction eliminates the problem of the height of the tree growing large due to skew in the spatial distribution of points. Updates are inexpensive and it is independent of the order of data insertions and deletions. PK-Tree is unique for any data set.

PK-Tree is created as follows: Initially the data points are in a rectangular area or cell. The Rectangle is divided recursively into smaller sub-cells. At each level, the division could be different for each dimension. The higher the level is, the smaller the cell size is. It uses simple rule to eliminate nodes with fewer children.

Each cell can be viewed as a set of points. The rules and definitions used in the creation of PK-Tree are: A cell having only one point location is called as point cell. A cell is K-instantiable, if either it is a point cell or the data points in the cell are not contained in less than less than K K-instantiable sub-cells. Every K- instantiable cell is instantiated and becomes a node of the PK-Tree. Given the data set, the dividing ratios for all levels, and a value for K, the set of Kinstantiable cells is unique. Root node, which is the cell at level 0, is always instantiated. Every node, except the root node, is mapped one-to-one to a K-instantiable cell.

4.1.8.1.1 PK-Tree Vs PR-quad-tree and KD-Tree


In PR-quad-trees, the height of the tree depends on the minimum Euclidean distance separating two data points rather than the total number of data points.

45

Depending on application this tree can lead to inefficient storage or search performances due to the unbalanced tree structure. KD-Tree removes some unnecessary nodes from PR-quad-tree, but its height can be very large in case of large datasets.

4.1.8.1.2 PK-Tree Vs SR-Tree and X-Tree


SR-Tree has a larger generation time as it requires computation of bounding sphere and bounding rectangle. The generation time increases as the number of dimensions increases for X -Tree as number of super nodes becomes larger. PK-Tree has shorter generation time and the time increases slowly with increase in dimension. For search queries PK-Tree outperforms both SR-Tree and X-Tree as there are no overlapping siblings in PK-Tree. The response time for X-Tree increases at a slower rate that of the SR-Tree while PKTree has the slowest rate of increase with the dimensionality. As it eliminates the non-K-instantiated nodes, PK-Tree removes the performance impact of skewed data distributions and performs well than X-Tree and SR-Tree.

Figure 8a: PK-Tree

46

Figure 8b: PK-Tree Structure PK-Tree structure for the running example is shown in Figure 8a. Here the data space is two-dimensional. The dividing ratios fro all levels are r1 = 2 and r2 = 2, the value of K was set to 3. The data objects are represented by points and labeled using alphabets. Among the cells at level 2, 8 and 13 are 3-Instantiable and others like 1, 2 and 4 are not 3-Istantiable as the objects contained in them cannot be covered by one or two subcells. At level one all the cells A, B, C and D are 3-Instantiable. For example subcells 1,2, and 6 cover the objects in cell A.

4.1.8.2

Salient Features

Given dataset D of N nodes, the set of dividing ratios R with r as the maximum fan-out factor from the set, and a value for K: Any two cells should have one of following relationship: disjoint, subset and superset. The cells at the same level are disjoint and at different levels are either disjoint or one is a proper subset of other. Children of the Non- leaf nodes are disjoint. Each intermediate node has at least K and at most (K-1)*r children. It has at least [ N + ( (N -1)/ ( (K-1)*r 1) ) ) ] and at most [ N + ( (N+K-2)/(K-1) ) ] nodes. There exists a unique PK-Tree for a given dataset. It is independent of the order of insertion and deletion of the data. The expected height of the PK-Tree is of the order of O(log N). At each level a cell is divided along selected subset of dimensions.

4.1.8.3

Operations

Search: Given a location and range the search algorithm returns the set of objects within the range distance from the location. The algorithm begins from the root of the PK-Tree and recursively traverses down all the nodes that have non-zero intersection with the sphere centered at the location within the range until the leaf level is reached. If an intermediate node is entirely within the range, then all the points in the node are

47

retrieved. If h is the height of the tree, the computational complexity of small range queries is O(h) and for larger range queries, the data objects returned are linearly proportional to N and the complexity becomes O(N) where N is the cardinality of the data set upon which the PK-Tree is built. Exact match query is a special case of range query with range set to zero. Search(N,d,Q): Input: Root node N, the location d and query range Q. Output: All data objects within the range Q from d. Step 1: Let Result = NULL. Step 2: If N is a Non- Leaf node, if Q encloses an entry E in N, Step 4 is performed. Step 3: If N is a Non-Leaf node, if Q overlaps with an entry E in N, then Let N = E and procedure is repeated from Step 2. Step 4: If N is a Non-Leaf node, for each entry E in N, let N = E and the process is repeated from Step 4. Step 5: If N is a Leaf node, Result = Result U N.Data. Insertion: Insertion of an object into the PK-Tree with rank K is achieved by inserting the corresponding point cell into the Tree. PK-Tree is generated from an empty tree and the data points are inserted one by one. A data point is inserted into corresponding leaf node in the following two phases: The path from root to a leaf node is traversed to locate all the potential ancestors and the new data point is inserted. Following the path from the leaf node back to the root along the same path, necessary changes like instantiation and de- instantiation are made.

The average time complexity for insertion is proportional to O(log n) and O(n) in the worst case where n is the number of data objects in the PK-Tree. Since the number of children of each node is bounded, the complexity of the algorithm at each level is constant. Insert(N,O): Input: Root node N and object to be inserted O. Output: Reorganized PK-Tree. Step 1: If N is a Non-Leaf node and there exists an entry E such that O is contained in E, Insert(E,O) is invoked.

48

Step 2: If N is a Non-Leaf node and O is not contained in any entry, then O is added to N and Update(N) is invoked. The Updation algorithm checks whether a node is instantiable and also updates all the nodes along the path to the node. Update(N): Input: Node N which has to be updated. Output: New Reorganized PK-Tree. Step 1: Let C be the set of all children of the node N. Step 2: If N is not he root node and C has le ss than K entries, Node N is de- instantiated and all of its children are added to its parent node P and Update(P) is invoked. Step 3: If there exists a K- instantiable entry E in the node N, then Steps 4 to 7 are followed. Step 4: The sub-cell E is instantiated. Step 5: E is made child of N. Step 6: All entries in C which are contained in E are made its children. Step 7: Update(N) is invoked. Deletion: Deletion of a data point is removal of its point cell from the tree. Similar to insertion it can also be done in two phases: The path from the root to the leaf node having the data to be deleted is traversed and the entry is removed. From the leaf back to the root, the necessary changes like instantiation and deinstantiation of nodes is done to maintain the tree.

Deletion algorithm first traverses from the leaf node to the leaf node containing the data to be deleted. At each level the entry which contains the object is selected. If the leaf node is reached, the object is removed from it. The steps for updation are followed as given above for instantiating and de- instantiation of nodes along the path. The computational complexity of deleting an object is O(Log N) in average and O(N) in the worst case. Delete(N,O): Input: PK-Tree rooted at N and Object to be deleted O.

49

Output: Reorganized PK-Tree with O removed. Step 1: If there exists an entry E such that O is contained in E then Step 2 is followed. Step 2: If E is a Leaf node, E is de- instantiated and Update(N) is invoked. Step 3: If E is a Non-Leaf node Delete(E,O) is invoked. PK-Tree is a variation of PR-quad-trees. It differs from existing index structures by employing unique set of constraints to eliminate unnecessary nodes that can result from skewed data distribution. The total number of nodes in PK-Tree is O(N). The average height of the tree is O(Log N). It improves the creation time and query time compared to existing spatial index structures. It performs well for uniformly distributed and for most skewed data distributions. It has the properties like: non-overlapping of sibling nodes, uniqueness of PK-Tree for a given data set independent of the order of insertion or deletion and bounded number of children. PK-Tree outperforms the existing spatial index based methods like SR-Tree and X-Tree which are based on R-Tree.

4.1.9 G-Tree
Grid-Tree or G-Tree [Kuma94] is a multi-dimensional index structure for Point Access Method, which combines the features of Grid files [Sevc94] and B-Trees. It divides the multidimensional space into a grid of variable size partitions and the partitions are organized into a B-Tree. It orders and numbers objects in such a way that partitions that are spatially close together are also close in terms of their partition numbers. It adapts well for structures with high frequency of insertions and deletions and to non- uniform distributions of data. The structure proposed is similar to BD-Tree [Osha83]. The partitions correspond to disk pages and points are assigned to a partition until it is full. A full partition is split into two equal sub-partitions and the points in the original partition are distributed among new partitions. This completely ordered partitions are stored in a B-Tree. Like K-D-B-Tree [Robi81], G -Tree is a balanced Index structure dividing the data space into set of non-overlapping regions.

4.1.9.1.1 G-Tree Vs BD-Tree


BD-Tree, which has same partition numbering as G -Tree, can exclude some subpartitions from a given partition while G-Tree maps a partition directly to a node. Though BD-Tree guarantees storage utilization of at least 50%, it can get highly balanced whereas G-Tree is a balanced structure. BD-Tree which is a binary search tree is a main memory structure, while G- Tree is based on B-Tree, which is more suitable for storage on disk. G-Tree indexes the partitions in a better way for efficient access.

50

4.1.9.1.2 G-Tree Vs KDB-Tree


Fan-out of G-Tree is about 2.5 times as large that as the fan-out for a KDB-Tree for two-dimensional case and the advantage increases for higher dimensions. In G-Tree when a partition becomes full, it is split into two equal sub-partitions, whereas in KDB-Tree the partitions may be unequal. KDB-Tree results in forced splitting of the children nodes when the parent nodes are split. The splitting method in KDB- Tree also affects the algorithm for handling deletion, which requires reorganizations and often deteriorates the storage utilization. In G-Tree regardless of the order of insertions and deletions, the partitioning becomes identical.

4.1.9.2

Salient Features

Partitions are numbered as binary string of 0s and 1s. Initially the entire data space is divided along the first dimension into two equal sub-partitions and numbered as 0 and 1. When a partition 0 becomes full, it is subdivided to create to create two partitions, 00 and 01, of equal size. For d dimensions the splitting dimension recycles with a periodicity of d such that each dimension appears once in a cycle. A leaf node points to a page that contains all points in a partition, while higher level nodes point to nodes at next lower level. The most significant bits of partitions are used to compare two partitions, to know the relationship between two partitions. The data space is divided into non-overlapping regions of variable size. Each dimension has values within a specified range. Each region or partition has a maximum of M entries. The partition numbers are of variable length and each one is as long as is necessary. Each partition is assigned a unique partition number. A total ordering is defined on partition numbers and they are stored in a B-Tree. Empty partitions are not stored in the tree to save space. Partitions are compared based on their most significant bits. The complement of a partition is obtained by inverting its last bit. The partitions in G-Tree do not overlap and are ordered by a greater than relation.

51

Figure 9a: G-Tree

Figure 9b: G-Tree Structure The data space partitioning of the running example for G- Tree is illustrated in Figure 9a. The data objects are labeled in alphabets. The regions are labeled b the binary string corresponding to its location. The non-empty regions are stored in a B -Tree. The data divisions on the left side of the first division begin with 0, while the right begins with 1. If the left half of figure 10, the region above are labeled beginning with 01 and below are labeled with 00.

4.1.9.3

Operations

Search: For searching a given query region, first smallest and largest partition numbers that could overlap with the given query region are determined. All the partitions in this range contain points that overlap with the query region. If a partition is fully contained then all the points in it satisfy the query, if it overlaps then each point in it must be checked individually and if it lies outside the query region, it need not be searched

52

further. The Search Algorithm first transforms the left- most and right most points of the query region into b-bit long partition numbers Pl and Pr respectively. Next the G-Tree is searched and all partitions in the range Pl- Pr are checked. If the partition is fully contained in the query region all the points in the partition are returned. If the Partition overlaps with the query region, all the points in the partition are examined one by one and those points which lie within the query region are returned. If the partition does not overlap or is not contained then next partition is checked and this is done recursively till all the points in the query region are recovered. Search(N,Q): Input: N the root node of G-Tree and the query region Q containing leftmost-point ql and rightmost point qr. Output: All objects in the query region. Step 1: Let Pl = Transform(ql) and Pr = Transform(qr). Step 2: Let P be the smallest partition in the G-Tree. Step 3: Until P > Pl, P = Pnext is performed to search for Pl in the G-Tree. Step 4: For all partitions in the range Pl to Pr the Steps 5 to 8 are performed. Step 5: Let O = overlap(P) to check whether the partition is contained in or intersects the query region. Step 6: If P is contained in the query region, all entries in P are retrieved. Step 7: If P overlaps the query region, each entry in P which intersects Q is retrieved. Step 8: P = Pnext, to get the next partition in the G-Tree. Function overlap transforms the partition number into coordinates of its leftmost and rightmost points. And then two sets of tests are performed. The first test checks if both leftmost and rightmost point lies in the query region for all dimensions; if so the partition is fully contained within the region and a value of 1 is returned. If the first test fails, then the second test checks for an overlap between the partition and the query region and returns value 2 if successful. Overlap(P): Input: A partition P which has to be checked for overlap. Output: 1 if P is contained in the query region, 2 if P intersects the query region. Step 1: Let Val = 0 and b = Number of bits in P. Step 2: For i ranging from 1 to b/n Step 3 is performed.

53

Step 3: For j ranging from 1 to n Steps 4 to 5 are performed. Step 4: Let p = n * (i-1) + j. Step 5: If p = b and bit in position p is 0, then ql.xj = (qr.xj + ql.xj)/2, else qr.xj = (qr.xj + ql.xj )/2. Step 6: Let j = 1. While (j = n and ql.xj = hj and qr.xj = lj), j = j+1 is performed. Step 7: If j == n+1, then Val = 1 as partition P is contained in the region and Val is returned, else Step 8 is performed. Step 8: Let j = 1. While (j = n and ql.xj = hj and qr.xj = lj), j = j+1 is performed. Step 9: If j == n+1, then Val = 2 as partition P is contained in the region and Val is returned Insertion: Insertion first assigns a partition number, and then searches the tree to locate the partition where the point has to be inserted. If such a partition exists it inserts the point and in case of overflows splits the partition into two equal partitions. This partition is done recursively until no one partition has all the points. If no partition exists, a new partition is created and added to the tree. The search algorithm initially computes the partition number p for the given object with the number of bits equal to that of the smallest partition created so far. Then it searches for a partition which contains the partition number p. If such a partition exists, then the object is inserted into the partition. If there is no partition available, a new partition is created with itself or its largest ancestor which does not overlap with the existing partitions in the G-Tree. If the partition overflows, two new partitions are created and the points are reallocated among the two new partitions. If one of the partitions is empty, it is not inserted into the tree; else both the partitions are inserted. The splitting must be repeated until the all the points are distributed across more than one partition. Insert(N,O): Input: Root node N and the object to be inserted O. Output: Reorganized G-Tree with O included. Step 1: Let P = Transform(O), be the partition number of the object. Step 2: An exact match query, Search(N,O) is invoked to get the actual partition Pa which has P. Step 3: O is inserted in Pa.

54

Step 4: If Pa overflows, two new partitions P0 and P1 are created by appending 0 and 1 respectively with Pa. Step 5: Pa is deleted from the G-Tree. Step 6: Objects in Pa are reallocated to P0 and P1. Step 7: Insert(N,P0) and Insert(N,P1) are invoked. Deletion: Like the Insertion algorithm, the deletion algorithm first computes a partition number p with number of bits equal to the dimension of the smallest partition the G-Tree. Then the G-Tree is searched to find a partition which contains the partition number. If no partition is found then the given point does not exist. If a partition was found, the given point is searched and deleted from the partition. If the partition could be merged with its complement partition, then the entries of the two partitions are removed, all the entries are entered into a node and the new node parent node is inserted. Delete(N,O): Input: Root node N and the object to be deleted O. Output: Reorganized G-Tree with O deleted. Step 1: Let Pd = Transform(O). Step 2: Search(N,O) is performed to get the Partition P which contains Pd. Step 3: O is removed from P and let Pa = P. Step 4: While the total number of points in Pa and Complement partition of Pa is less than the maximum number of entries allowed, the steps 5 to 8 are performed. Step 5: Objects in Pa and Complement of Pa are reassigned to the Parent Pt. Step 6: Pa and Complement of Pa are removed from the G-Tree. Step 7: Insert(N,Pt) is invoked to insert the Parent node into the G-Tree. Step 8: Let Pa = Pt. G-Tree is an index structure based on organizing variable-sized grids or partitions into a B-Tree. It has an ordering property such that the partitions which are close in multidimensional space are also close in terms of their partition numbers. Its structure adapts well to dynamic data spaces with high frequency of insertions and deletions. It suits well for non-uniform data and is independent of the order of insertion. It performs well than the BD-Tree which has a similar numbering scheme.

55

4.1.10 MB+-Tree
Multidimensional B+-Tree or MB+- Tree [Yang95] can be considered as an extension of B+-Tree to multi-dimensions. B+-Tree stores the data values in leaves and is copied into internal nodes in case of necessity. MB+-Tree takes into account all the characteristics of image and video database management systems employing content-based retrieval than its previous methods. To retrieve an image based on its content, it is necessary to extract the features which are characteristic of an image and index the image on these features. Feature vector can be represented as a multidimensional vector with each component denoting different feature measures. MB+-Tree has similarities and differences with RTree [Gutt84] and its variants.

4.1.10.1.1

MB+-Tree Vs B+-Tree:

Insertion and Deletion in MB+-Tree are extensions of the corresponding algorithms of B+-Tree in higher dimensions. The search methods in MB+- Tree are different as it tackles similarity queries.

4.1.10.1.2

MB+-Tree Vs R-Tree and its variants

Unlike other multidimensional index structures MB+-Tree uses a linear ordering for indexing the multidimensional space. For R-Tree and its variants, searching an intermediate node requires examining all entries in the node and examining an entry requires comparing the boundary values in all dimensions. For MB+-Tree, some entries in the node may not need to be examined and can return at an intermediate step. Also examining an entry may not require comparing all dimensions. Though the worst case time complexity for search is same as that of R- Tree and its variants, in practice can give better performance as some entries and some dimensions of some entries may not be examined. This is advantage of using the linear ordering.

56

Figure 10a: MB+-Tree

Figure 10b: MB+-Tree Structure Figure 10a illustrates the two-dimensional data space partitioning of the running example for MB+-Tree. The data objects are labeled with alphabets. As the data regions are stored in B+-Tree, they are labeled using the order of the dimension. For example, the first number of a region gives the position of the region in the first dimension, the second number its position in second dimension and so on. These labels are used to order the data regions. After being ordered linearly, the data objects are stored in a B+-Tree. This allows search algorithms to directly move from one region to next by using the linked list of leaves.

4.1.10.2 Salient Features


For a d-dimensional space, MB+-Tree partitions the data space along the first dimension. As partitions become, they are split. The splitting is initially done along the first dimension, till the resulting regions have width above a threshold value. Then the region 57

is divided independently along the second dimension. The region can be split continuously until its width is above a threshold, and then the split is along the next dimension as new objects are inserted into MB+-Tree. A linear order can be defined on the set of all regions, by comparing the boundary values in the same order as the dimensions and a B+-Tree can be built using the order. In MB+-Tree, the partition can occur anywhere along a dimension. If there are a large number of data points clustered in a region, MB+-Tree can split the objects evenly. This decreases the height and hence the search time of the tree. The values used for splitting should be stored at each level. For example for two-dimensional the structure of MB+-Tree is as follows: (x,y) represents an attribute in the 2-dimensional space. The space is partitioned into M vertical strips by M-1 vertical partitions. Then Each Vertical strip S is partitioned into Ns regions by (Ns -1) horizontal lines. The value of Ns can be different for different strips. Thus the 2-dimensional space is partitioned into a set of disjoint regions, each of which is a rectangle. The horizontal dimension is called as the first dimension and the vertical dimension is called as the second dimension. The M vertical strips are from left to right and the Ns horizontal regions for a vertical strip S is from bottom to top. This yields a linear order on the set of all regions. Using this order a B+-Tree is built on the set of all regions.

The properties of the MB+-Tree are: Each internal node has M entries and M+1 pointers. Each entry represents a region and each pointer points at a child node, which is the root of a sub-tree. All leaf nodes are linked together by a double- linked list. Each leaf node contains one or more leaf entries representing regions. Each region appears exactly once at leaf level. All regions are ordered from left to right according to their linear order. The right- most entry of a leaf node except the last leaf node will appear in the internal node on the path from the root to the node. A rectangle is represented by two points, its lower- left and upper-right, for the nonleaf nodes. For the leaf nodes a rectangle is represented only by its lower-left point, which will increase the fan-out in the leaf nodes. Each leaf entry is of the form (lower- left-point, pointer-to- list), where the pointer points to a list of all the entries in the rectangle of the form (attribute, address).

58

The linear ordering has the following advantages: Space required for each entry at the leaf level is reduced by nearly half. This will reduce the number of leaf nodes and hence the size of the MB+-Tree resulting in a better search performance. Insertion and deletion algorithms are similar to those for a B+-Tree and are simpler than for those for other trees like R-Tree and its variants. The entries in the intermediate levels correspond to an element in the set as do entries in the leaf level. MB+-Tree maintains locality to some extent which should give better performance in searching.

4.1.10.3 Operations
Initially the MB+-Tree has only one leaf node which is also the root node with only one entry. Each object inserted is simply added to the list until it is full. Then splitting operation is required for the nest insertion and the entire space will be divided into two along the first dimension. The process continues until space is divided into smaller regions. Then it is done along the second dimension and so on. Search: Given a query region, all the objects which belong to the query region have to be retrieved. Each leaf node is scanned for overlapping, and for those leaf nodes the entries in the list pointed by its pointer are scanned to locate the required objects. The search algorithm first finds the leaf nodes and then goes through all the entries of the leaf node. The entries in the internal nodes are scanned to find all the sub-trees that contain at least one leaf entry overlapping with the query region, when more than one sub-tree overlaps with the query region all such entries are searched recursively. As the rectangle in an entry is not the enclosing rectangle, the algorithm has two loops as the condition to find the first sub-tree is different from identifying the following sub-tree. For finding the first subtree the condition is defined as: ((Q Precedes MBR on the first dimension) or (Q overlaps with N.MBR on the first dimension and precedes or overlaps on the second dimension)) For finding the subsequent nodes the condition used is, NOT(Q follows MBR on the first dimension or (Q overlaps with MBR on the first dimension and follows MBR on the second dimension)) Leaf nodes store only the lower- left corners of the rectangle. But for checking overlap the upper-right corners are also required. For all leaf nodes except the last leaf node the last entry in the node will appear in an internal node on the path from the root to the leaf node. If all nodes on the path are maintained in the memory, the upper-right corner of an entry can be found using the upper-right corner of the right most entry, since all the leaf entries are sorted in a linear order.

59

Search(N,Q): Input: Root node of MB+-Tree N and the query region Q. Output: Objects in the query region Q. Step 1: If N is a leaf node N returned. Step 2: Among the M entries, the first entry E that overlaps with Q is determined and is added to a list NODE. Step 3: From the following entries, the nodes that have at least one entry overlapping with the query region and add its entries to NODE. Step 4: SET = NULL and for each entry E in NODE, S = S n Search(E,Q) is invoked. Step 5: Let RESULT = NULL. For each entry N in S, the Steps 6 and & are followed. Step 6: Starting from the right- most entry, for each entry the upper-right corner is determined. Step 7: If the entry overlaps with the query region, it is added to the resultant set. Step 8: The Resultant set is returned. For R-Tree and its variants, searching a node requires searching all of its entries. For MB+-Tree searching could stop at intermediate step and some entries in the node need not be examined and examining an entry may not require comparing on all dimensions. Insertion: Using the linear order, the insertion is a standard B+-Tree operation. If there is an overflow the region is first split along the first dimension, then along the second dimension and so on. If the region is a strip in a particular dimension, till the sub-region reaches a minimum size the split can be done along the same dimension. Then the splitting is done along the remaining dimensions. Insert(N,O): Input: MB+-Tree rooted at N and object to be inserted O. Output: Reorganized MB+-Tree. Step 1: Using exact search que ry, Search(N,O), the leaf node N where the object has to be inserted is determined. Step 2: O is added to N. Step 3: If there is overflow in N, Split is performed as given is Steps 4 to 6.

60

Step 4: During split, the list will be divided into two lists of about the same size. Step 5: If the region to be split is a vertical strip that has not been divided by a horizontal line, it is divided by another vertical line by choosing a value for the minimum length of the horizontal side of the vertical strip. Step 6: If the vertical dividing results in a thin vertical strip, with horizontal side smaller than minimal value, horizontal dividing is done. Step 7: After a region has been split into two smaller regions, a new entry will be inserted into the tree. Deletion: Deletion is similar to the standard B+-Tree deletion. Deletion may be required when a list becomes too small. Delete(N,O): Input: MB+-Tree rooted at N and object to be deleted O. Output: Reorganized MB+-Tree. Step 1: Using the exact match query, Search(N,O) the leaf node N which has the entry to be deleted is determined. Step 2: O is removed from N. Step 3: If the list in N becomes too small it is merged with another list. Step 4: Two neighboring region with split along the same dimensio n can be merged for example a vertical strip not divided by a horizontal line can be merged with another such vertical strip. Step 5: After the two lists are merged, one leaf entry is deleted from the tree. Index structures to be used for content-based retrieval in multimedia databases need to take into account the types of queries. MB+-Tree was created mainly for similarity based queries and uses a linear order based approach and has the following useful features: Supports nearest neighbor query efficiently. Supports different similarity measures. Requires less space for leaf level entries.

4.1.11 hB-Tree
An ideal Index Structure must have the following properties: Good average storage utilization both in intermediate and leaf nodes. High fan-out, which results in small number of disk accesses.

61

Easy incremental reorganization as the file grows. Simple algorithms and Ability to handle different kinds of queries.

Many index structures exhibit these properties only some times. K-D-B-Tree [Robi81] was proposed a a Multidimensional Access Method. It has the following desirable s properties: A balanced tree structure. All leaf nodes area the same level. The data is stored in the leaf nodes and the intermediate node contains indexes which direct the search. It adapts to the distribution of attribute values.

Holey-Brick Tree or hB-Tree [Lome90] is derived from K-D-B-Tree but has additional desired properties. It uses K-d-Trees [Betn75] to organize the space represented by the interior nodes for very efficient searching. The advantages of using K-d-trees are: Compared to boundary representation, as the regions in the k-d-tree share boundaries, they have high intra-node search space. K-d-tree requires less number of comparisons during searching than boundary representation. It uses less space than the boundary list representation.

The nodes in hB-Tree represent bricks from which smaller bricks have been removed. In order to minimize redundancy k-d-tree corresponding to an interior node can have several leaves pointing to a same node. It grows from leaves and has all the leaves at the same level like B-Trees. hB-Tree solves the node-splitting problem of K-D-B-trees, by using more than one attribute values while splitting.

4.1.11.1.1

hB-Tree Vs K-D-B-Tree

Index nodes of hB-Tree are organized as k-d-trees. Splitting of a node is done based on multiple attributes. Nodes do not correspond to d-dimensional interval, but to intervals from which smaller intervals have been removed. Cascading of splits which is the drawback of K-D-B-Trees is avoided in hB-Trees. It is not strictly a tree, but a directed acyclic graph. K-D-B-Tree does not specify how to organize the data within its nodes.

4.1.11.1.2

hB-Tree Vs B+-Tree

The insertion algorithm differs from the B+-Tree in the following details:

62

1. The organization and splitting of data and index nodes. 2. Posting the index terms to the next higher level. 3. The guaranteed storage utilization. hB-Tree requires more space to store index terms than B+-Tree and so has less fanout than B+-Tree. hB-Tree differs from the B+-Tree only in the organization of index terms into k-dtrees and in splitting of data between nodes.

Figure 11a: hB-Tree

Figure 11b: hB-Tree Structure

63

In hB-Tree, the partition is done in more than one dimension. Figure 11a illustrates this property of hB-Tree over a two-dimensional running example. The root node has entries of two child regions R and S partitioned by X1 along X-axis and Y1 along Y-Axis. Region R has two parts one containing the subregions 1 and 2 and second one containing the subregions 5 and 6. The entry EXT in region R represents the portion removed from R. Region S contains the subregions 3 and 4.

4.1.11.2 Salient Features


hB-Tree is a derivation of K-D-B-Tree, where k-d-trees are used to represent the intermediate nodes. Node-splitting is done based on more than one attributes. Nodes represent regions from which smaller regions may have been removed. The holey region is called as enclosing region and the region removed is called as extracted region. As several leaves of the k-d-tree can refer to the same node at the lower level, it is not truly a tree, but a directed acyclic graph. During node splitting, if f is the fraction of data going into the new node and (1-f) into the original node then the storage utilization U is given by: U = [ f * Log( 1/f ) ] + [ ( 1- f ) * Log(1/(1-f)) ] If s is the size of a node and i is the size of an index term, Fan-out = (U* s) / i.

4.1.11.3 Operations
The k-d-tree leaves in an hB-Tree index node refer to lower level hB-Tree nodes. In each internal node of the k-d-tree, there is an indicator of which attribute is compared, what the comparison value is, whether the equality is on the left branch or the right branch. Search: Searching is same as that of a binary tree. For exact match queries, the number of nodes accessed is exactly the height of the tree. Range queries may require several nodes of the hB-Tree to be accessed at each level. Search(N,Q): Input: Root node N of the hB-Tree and the query region Q. Output: Objects in the query region. Step 1: If N is a Leaf node, Objects which are in Q are retrieved from the node. Step 2: If N is a Non- Leaf node, the steps 3 to 5 are followed. Step 3: If the comparison attribute value in a k-d-tree node is greater than those in the query range, Search(N.L,Q) is invoked. Step 4: If the attribute value is smaller than that in Q, Search(N.R,Q) is invoked. 64

Step 5: If the comparison value is in the middle of the search range, then both Search(N.L,Q) and Search(N.R,Q) are invoked. Insertion: The exact match search finds the node, where the data has to be inserted using k-d-trees in hB-Tree index nodes. Within this node the location where the new object has to be inserted is determined. After insertion if the node overflows it is plit. Insert(N,O): Input: Root node of hB-Tree N and object to be inserted O. Output: Reorganized hB- Tree with O inserted. Step 1: Exact match query Search(N,O) is performed to get node N where O has to be inserted. Step 2: O is inserted into N, if N has sufficient space. Step 3: If N overflows, Steps 4 to 6 are followed for splitting. Step 4: A new node is created and the data of the original node are split between the original and the new nodes. Step 5: An index term of the tree which identifies the new node is posted at the next higher level. If there is an overflow in the higher level index node, it is split. Step 6: After restructuring of the tree the new data is inserted into the tree. Split: For splitting entries in a data node, splitting along one dimension will not always split the data node evenly. When an index node is split, the information is posted in the parent to distinguish between the enclosing tree and extracted tree. While splitting an internal node, the requirement is that both the node contain between one-third and twothirds of the data. To reflect the absence of an extracted region, the hB- Tree node is assigned a marker which indicated that the region is no more a simple interval. By this technique the splits need not be propagated downwards. If no median hyperplane splits the space in a ratio less than or equal to 2:1 and if some ddimensional closed upper corner, D contains more than 2/3 of the points, then any (d+1)dimensional closed upper corner contained in D has more than 1/3 points. The leaf nodes of the internal k-d-tree can be one of the three: Refer a collection of data record. Refer to other hB-Tree nodes. Indicate that part of the tree has been extracted.

65

The following cases are considered while splitting an internal node: If the root of the k-d-tree in the internal node can be used to split the node such that both the new node have between one-thirds and two-thirds of the data. If the root of the k-d-tree cannot be used to split the node and one sub-tree has more than two-thirds of the data, this sub-tree is traversed recursively till either a leaf node is reached or a node is reached at which either of the sub-trees satisfy the requirement. If a sub-tree is available, it is removed from the tree and moved to a new node and both the nodes are updated. If a leaf node in the k-d-tree has to be split, a corner which has between one-third and two-thirds of the data is removed.

Split(N): Input: The internal node to be split N in hB-Tree. Output: Two new nodes N and N. Step 1: If the root of the k-d-Tree can be split, the Steps 2 to 4 are followed. Step 2: The right sub-tree is extracted. Step 3: The original node is updated to represent the new boundaries. Step 4: The extracted sub-tree forms the new node. Step 5: If root cannot be split, starting at the root of the k-d-tree in the intermediate node, the sub-tree containing more than two-thirds of the data is traversed till either a leaf node is reached or a sub-tree with between one-third and two-third of the data is determined. Step 3: If a leaf node is reached, sub-tree of either a partition or a corner is removed from the tree as given in the Steps 4 to 6. Step 4: The attributes are first ordered. Step 5: First the median values of all the attributes are checked, whether they could provide the desired split. Step 6: If no single attribute could be obtained, corners with different consecutive pairs of attributes are verified in their upper corner and lower corner. Following this procedure one partition will be obtained. Step 6: The sub-tree is removed and its entry in the original node is replaced by a marker. The original node is updated.

66

Step 7: The extracted sub-tree is moved to a new node and the tree is updated. Once an internal node is split the information should be posted in the parent node to distinguish between the enclosed and the extracted nodes. Deletion: Deletions of objects which do not yield under utilized nodes do not change the hB-Tree structure, but are done locally within a node. If a leaf node is under utilized, the siblings which were previously enclosed and extracted nodes could be combined. This is complicated and different cases are to be considered and treated. Also as hB- Tree deals with holey-bricks, it is always possible to find another node in which to store object from a node to be deleted. hB-Tree is derived from K-D-B-Tree, where internal nodes are represented by k-d-trees. In the hB-Tree, Split can occur in more than one dimension. Some features like redundant references to child nodes can cause problems. The deletion algorithm may be more complicated. Analysis of the space utilization and fan-out are derived.It adjusts to any pattern of incoming data.

4.2 Hashing Methods 4.2.1 Grid File


Each of the multi-key file structures in use has its strengths and its weaknesses and also has environments where it is well suited. There is a need for a file structure that provides a balance among the performance criteria. The Grid file [Sevc94] is designed to handle efficiently a collection of records with a modest number of search attributes whose domains are large and linearly ordered. It combines several properties of file structures like high data storage utilization, smooth adaptation to the contents stored, fast access to individual records and efficient processing of range queries. It is an example of the technique where the embedding space, from which data is drawn, is organized. The goal of the grid file is to retrieve records by at most two disk accesses and to efficiently handle range queries. This is done by using a grid directory consisting of grid blocks that are analogous to the cells of fixed grid method. All records in one grid block are stored in the same bucket and several grid blocks can share a bucket. Since the directory may grow large, it is usually kept on the secondary storage. To guarantee that data items are always found with no more than two disk accesses, the grid itself is kept in main memory, represented by d one-dimensional arrays called scales. The goals of grid files are: Two-disk-access principle for point queries. Efficient processing of range queries in large linearly ordered domains. Splitting and merging of grid blocks involving only two buckets. Maintaining reasonable lower bound on bucket occupancy.

67

The first three points determine processing efficiency and the last one is for memory utilization. The reasons for designing of grid files are: Range queries demand grid partitions of the search space. Efficient update after modification of a gr id partition demands convex assignments of grid blocks to data buckets Two-disk access principle demands representation of an assignment by means of grid directory.

4.2.1.1

Grid File Vs other Multi-key access Techniques

Several criteria like operation speed, space utilization and adaptability are important in assessing multi-key file structures. Inverted files require excessive disk accesses for retrieval of inverted lists. Overhead required for insertions and deletions in inverted files become prohibitive in terms of space and time. Multi-key hash file has low average bucket occupancy or high likelihood of bucket overflows. Multi-key hashing is inappropriate when selection condition involves range of values. Transposed file is most effective when majority of operations deal with a significant portion of the records and selection condition involves few attributes.

Figure 12: Grid File Figure 12 shows the Grid File structure for running example consisting of twodimensional data space. The capacity of a block is set to four objects. The central block is

68

the root directory with scales along X- and Y- axes. The objects shown in the root node represent the actual objects and have entries for the objects. There is an one-to-one correspondence between the directory entries and the blocks. For example there would be three entries pointing a, d and A in the root node. This results in the main problem of this structure which is the exponential growth of the directory as the dimension of the objects increases. Efficient algorithms are needed for proper storage utilization.

4.2.1.2

Salient Features

A grid directory is a data structure that supports operations needed to update convex assignments during bucket overflow or underflow. The purpose of the grid directory is to maintain the dynamic correspondence between the grid-blocks in the record space and data buckets. The two-disk-access principle implies that all the records in one grid-block must be stored in the same bucket and several grid-blocks can share a bucket. It has a dynamic d-dimensional array called grid-array and d one-dimensional arrays called linear-array. Only the indices of the intervals above the point of splitting or merging are shifted. Bucket is used as storage unit of records. Grid partition assumes independent attributes. Each region boundary divides the search space into two and all dimensions are treated symmetrically. A Grid partition is modified by altering one component at a time. Access paths to separate buckets are disjoint, hence has simpler concurrency control protocols. Grid file was designed to handle large amounts of data.

4.2.1.3

Operations

Operations on grid directory consist of Direct Access, Next in each direction, Merge and Split. A file is created by repeated insertions. When deletions occur, merging operations are triggered. Repeated splitting on a set of buckets currently in use can be represented in the form of a binary tree, a twin system where each bucket and its region have a unique twin from which it split off. The essential decisions to be considered are: Grid partitions Assignments of grid blocks to buckets and Grid directory. Based on these decisions, the important issues to be considered are: Choice of splitting policy, Choice of merging policy,

69

Grid directory implementation and Concurrent access.

Search: For fully specified queries two disk accesses are needed, one for going through the linear array to determine the interval indices and these give direct access to the correct element of the grid directory, where bucket-address is located. For range queries, after eliminating duplicates the pages are fetched into main memory for detailed inspection. The following measure plays a major roe for the accuracy during Information Retrieval: Number of Records retrieved that meet the query specification)/ Total Number of records Retrieved. Search(L,Q): Input: Grid File with Linear array L and query region Q. Output: All objects in the query region. Step 1: For each attribute range, the corresponding intervals which overlap with the range are obtained from the linear array L for that attribute. Step 2: For each entry, if the is not in main memory, one disk access is necessary. Step 3: Another disk access is required to retrieve the data object to by each cell selected. Insertion: For inserting an object into a grid file, the linear array is searched to determine the interval where the new object should be inserted. Then the object is inserted into the location. If there are any overflows, the split algorithms are invoked. Insert(L,O): Input: Linear array of Grid File L and object to be inserted O. Output: New Grid File with the object inserted. Step 1: An exact match query is performed using Search(L,O) to get the interval and cell where the object should be inserted is located. Step 2: If the cell has sufficient space the new object is inserted. Step 3: If there is no sufficient space, splitting is done. Step 4: If a split was performed, the linear array is updated with the new intervals. Split Policy: The overflow of a bucket triggers a refinement of the grid partition. Several splitting policies are compatible with grid file and the simplest one is to choose the

70

dimension according to a fixed schedule like a cyclic one. The splitting policy may favor some attributes more than others. This increases the precision of answers to partially specified queries having the favored attributes. The location of the split need not necessarily be chosen at midpoint but also from a set of values that are convenient for a given application. Deletion: Deletion is not a local operation. With deletion of an entry, the storage utilization of the corresponding data page may drop below a threshold. Depending on the current partitioning of space merging may be required. This check requires complete scanning of the grid directory. Delete(L,O): Input: Linear Array L of grid file and object to be deleted O. Output: Grid File with O deleted. Step 1: An exact match query Search(L,O) is performed to locate the cell which contains the object O. Step 2: O is removed from the cell. Step 3: In case of under-utilization, merging policy is invoked. Step 4: If there was merging due to under utilization, the Linear Array is updated accordingly to remove the interval. Merge Policy: Bucket merging and Merging two cross sections in the grid directory are two levels of Merging. Directory merging occurs rarely, whereas Bucket Merging is an integral part of the grid file. Merging policy is controlled by three decisions: Selecting Pairs of adjacent buckets to be merged, Giving priority to the selected ones and Merging threshold that determines the bucket occupancy at which merging is triggered. For selection of the pairs to be merged two systems: Buddy system and Neighbor system can be used. In Buddy System a bucket can merge with exactly one adjacent buddy in each dimension. The assignment of grid blocks to buckets is such that buddies can always merge if the total number of objects fits into one bucket. This system can be used as a standard merging policy. In neighbor system, a bucket can merge with either of its two adjacent neighbors in each dimension. Both systems give reasonable performance. Based on the merging policy, priorities can be given to a pair for merging. Favoring a particular axis can be done when the granularity of the partitions along different axes are different.

71

The merging threshold p is selected so that after merging two buckets, the occupancy of the resultant bucket is at most the threshold percent. A grid directory behaves like a k-dimensional array for operations like direct access and next. A refinement in grid partition causes changes in structure only if the shortest interval is split. Grid file uses space economically over a wide range of operating conditions. The number of buckets grows in proportion to the number of records. Attribute correlations affect the size of the directory, but do not affect the average bucket occupancy. Performances to be considered while using grid files are: Growth of the file and directory with repeated insertions. Steady-state file, because in the long run the number of records in a file remains approximately constant. It tests the interaction between the splitting and merging policies. Shrinking of the file due to repeated deletions.

4.2.2 Buddy-Tree
Multidimensional Access methods are necessary for efficient storage and retrieval of geometric and Multimedia data. These methods should satisfy the following properties: Must be dynamic, should support arbitrary insertions and deletions of objects without any global reorganization. And without loss of performance. Efficiently support a large set of queries.

Buddy-Tree [Seeg90] is a dynamic hashing scheme with a tree structured directory. It was created to support point as well as spatial data in a dynamic environment. It can be seen as a combination of R -Tree [Gutt84] and Grid file [ evc94], but fundamentally S different from each of them. It avoids the downward splitting as seen in K-D-B-Tree [Robi81] and overlapping in R-Tree. It generalizes the buddy system of the Grid File to organize correlated data efficiently. This is done by bounding the data objects tightly using Minimum Bounding Rectangle concept of R-Tree and organizing the directory as in R-Tree. Like Grid Files, the non- zero sized data are mapped into higher dimensions. The partition concept of Buddy Tree prevents splits from propagating downwards. It is constructed by consecutive insertion, cutting the universe recursively into two parts of equal size with iso-oriented hyperplanes. The basic principle of this Point Access Method is to partition the data space into regions such that all objects in a data page are taken from one region. Indirect splits are completely avoided. Performance of the Buddy tree is independent of the sequence in which the data is inserted and performs well for highly correlated data. It organizes data using a tree based directory where each axis is treated equally. Searches can be performed faster as it avoids dead space and there exists only one path for an exact match

72

query. It has increased fanout. Due to its high number of candidates for performing merge operation, buddy tree has efficient dynamic behavior B-rectangles: Rectangle R is called a B-Rectangle of S iff R can be generated by successive halving of S. B-Rectangles have the following properties: If R and S are B-Rectangles of data space D and R is a subset of S, then R is a BRectangle of S. Let R is a B-Rectangle of S. If S is doubled in any arbitrary direction as S, then R is also a B-Rectangle of S. B-Region of R or B(R) is the smallest rectangle B such that R is a subset of D and B and B is a B-Rectangle of D.

A set of d-dimensional rectangles is called as B-Partition of data space D, if no two BRegions of the rectangles in the set overlap with each other. Any two rectangles in a BPartition are called buddies, if their union does not overlap with any of the other rectangles in the B-Partition. A B-Partition is called regular if all its rectangles in it can be represented in a Kd-trie. In the Kd-trie, the leaf nodes represent the rectangles of the B-Partition, the internal nodes represent the B-Rectangle and the internal nodes consist of an axis and two pointers referring to sub-trees.

4.2.2.1.1 Buddy-Tree Vs R-Tree and Grid files


Grid files loose performance for highly correlated data. Buddy Tree is designed to organize such data efficiently, partitioning only such parts of the data space which contain data and not partitioning empty space. Its performance does not depend on the order of insertion of objects and grows linearly in the number of records as the dimension increases. To avoid deadlock problem of the grid file, Buddy Tree uses k-d-tries to partition the universe. The number of possible buddies is larger than the grid files and other structures, which makes Buddy Tree more adaptive in case of updates. Differs from R-Tree and its variants by avoiding overlap between node entries. K-D-B-Tree and R+- Tree [Sell87] had cascaded splitting as main drawbacks, buddytree avoided these by allowing only special class of B -Partitions called regular B Partition. Buddy-tree avoids nodes with one entry and is unbalanced tree, but it guarantees a linear growth of the node in the number of data.

4.2.2.1.2 Buddy-Tree Vs Other Structures-Performance comparison


Buddy-tree outperforms hB-Tree [Lome90] and Grid-file for all data distributions which have one of the following data characteristics: 1. Densely populated and unpopulated varying over space. 2. Data inserted is in sorted order.

73

For uniform and bit distributions, though it doesnt have the best search time, it is competes with the hB-Tree and Grid- file. Even for its worst case distribution Buddy-Tree performs better than Grid- file.

Figure 13: Buddy-Tree Buddy-Tree structure for the two-dimensional running example is illustrated in the Figure 13. Maximum number of entries that can be stored in a page is set to 4. The space around the Minimum Bounding rectangles of at most 4 data objects is dead space. The buddyTree is of height two. At top level the bounding rectangles contain the Bounding rectangles of the next lower level. At level one, the B -Rectangles cover the objects in their regions. The points X1 to X3 and Y1 to Y2 represent the location of the partition along X and Y directions respectively.

4.2.2.2

Salient Features

Directory grows linearly in the number of objects. No overflow of nodes is allowed. The data space is partitioned into minimum bounding rectangles of actual data. Empty data space is not partitioned. Performance is independent of the sequence of insertions. Each node contains at least two entries. Due to this the buddy tree may not be balanced and the growth is linear.

74

Each entry is of the form (MBR, Child_Pointer), where MBR is a d-dimensional rectangle and Child_Pointer refers to a sub-tree containing all the rectangles in its MBR. Each interior node corresponds to a d-dimensional partition in an interval. Partitions that correspond to the same tree level are mutually disjoint. The leaf nodes point to the data pages. Whenever a node is split the Minimum Bounding Boxes (MBBs) of the resulting subnodes are recomputed to reflect the current situation. This tries to achieve high selectivity at node level. Except the root, there is exactly one pointer referring to each node, which makes sure that the growth of tree is linear. The union of all regions does not cover the complete data space. As it avoids overlap, Insertions, deletions and exact match queries are restricted to one path of the tree. It has high fanout for the intermediate nodes, thus the height of the tree and hence the retrieval costs are reduced. Deadlocks cannot occur in a buddy-tree as it does not cover the dead space.

4.2.2.3

Operations

Search algorithm retrieves all the data that overlap with the query region. An exact match query is restricted to one path of the tree. Range query is similar to the exact match query, where there can be a set of answers. Given the Buddy-tree and query region Q, the tree is traversed and at each node the position of the first rectangle in the node intersecting with the query region Q is obtained. If there is no such rectangle the value of position is zero. f the node is an intermediate node then, the rectangle at the position is searched. If the node is a leaf-node, the entry in the node returned. The position of next rectangle after the current value of position which intersects with the query region is obtained and the process is repeated till the position becomes zero. Search(N,Q): Input: Buddy Tree rooted at N and Query region Q. Output: All objects in Q are retrieved. Step 1: Let P be the position of the first object or rectangle in N which intersects with Q. If there is no such entry P is set to zero. Step 2: While P is not Zero the Steps 3 to 5 are followed. Step 3: If N is an intermediate node and if E is the child node of N at position P Search(E,Q) is invoked. Step 4: If N is a Leaf node, the objects at the given position P is retrieved from N.

75

Step 5: P is set to the next object or rectangle which intersects the query region Q. Insertion: Insertion algorithm searches for the node where the data has to be inserted and inserts the data. In case of overflows, splits the node and propagates the split. For insertion initially using an exact match query, the leaf node where the new data has to be inserted is determined. If this search ends in an intermediate node, a new entry is created where the rectangle is described by the data to be inserted. Then a buddy with which the new rectangle can be merged is searched and the data is inserted into the node. If no buddy can be found, a new node is created and the record is inserted. The insertion is restricted to one path of the tree. The difficult case for an insertion is when the query ends in an intermediate node. In this case, a new non- leaf node entry is created where the object has to be inserted. Then this node is inserted into the Buddy Tree at a suitable location. Insert(N,O): Input: Buddy Tree rooted at N and Object to be inserted O. Output: Reorganized Buddy Tree. Step 1: An exact match query Search(N,O) is performed to find the node N, where the new data has to be inserted. Step 2: If N is a Non- Leaf Node, a new entry E is created for O to be inserted. Step 3: E is added to N. Step 4: A search is performed to find another entry in the node with which the new entry can be merged. If there exists an entry, the two entries are merged and the node is updated. Step 5: If N overflows, Split(N) is invoked and the new nodes are updated. Step 6: If O is not already in the tree, insert it into N. In case of overflow Split(N) is invoked. Node -Split: The split algorithm is the most complicated algorithm of the Buddy-Tree. Similar to B-Tree algorithms a split can propagate up the tree, but cannot leave the topdown search path. If the distribution of the entries is uneven then a buddy that can be merged with the under- filled node is determined. The Spit algorithm first determines the parent node of the node to be split and the position of the node in its parent. Then the axis along which the split is to be performed is determined. In case of ties, the axis which has the minimum sum of margins of rectangles is selected. The entries are divided into tow groups corresponding to a hyper-plane perpendicular to the split axis. One group of entries remains in the old node and the other group is stored in the new node. The original node is updated. New entry is inserted in

76

the parent node. If there is an overflow in the parent node, then it is split. Then all the nodes in the path are updated. The step of determining the Parent node has the following exceptions: If the node is a root, a new root is created and filled with one entry referring to the split node. If it is a leaf node which is not in the deepest level of the buddy tree, then a new father node is created with one entry referring to the original split node. The level of the node is incremented.

The updation algorithm first finds the parent of the current node. If the node is a root node the algorithm stops. Then the entry of the node in its parent is determined. If the bounding-box of the entry in the parent node differs from the actual bounding box of the node, then the entry in the parent node is updated and the steps are repeated to the parent node. Split(N): Input: Node to be split N. Output: Reorganized Buddy Tree. Step 1: Let P be the Parent node of N. Step 2: The split axis along which the node is to be split is obtained. Step 3: Node N is created and the entries are distributed among and N. Step 4: The Nodes are updated. Step 5: Entry for N is inserted into P. Step 6: If P overflows then Split(P) is invoked. Step 7: If P is not a root node, Let PP be the parent node of P. Step 8: Ps entry in PP is updated. Step 9: Let P = PP. The process is repeated from Step 7 on node P. Dele tion: Deletion is same as the insertion algorithm and has only one path from the root. For deletion the tree is searched for exact match query to determine the node which has the data to be deleted. The entry is removed and the nodes in the path are updated. Delete(N,O): Input: The root node of Buddy Tree N and object to be deleted O.

77

Output: Reorganized Buddy Tree. Step 1: Using exact match query Search(N,O) the node which has the entry of O is determined. Step 2: O is removed from N. Step 3: If N is not a root node, Let P be the parent node of N. Step 4: Ns entry in P is updated. Step 5: Let N = P. The process is repeated from Step 3 on N. Buddy-Tree is a combination of R-Tree and grid file but different from each of them. Generates rectangular regions as minimal as possible and therefore the data space is not completely covered by these regions. It avoids overlap of the nodes. As it uses generalized buddy system, its performance is independent of the order of insertion of the data. Has superiority and robustness over other index structures like Grid-files, hB-Tree and R-Tree.

4.3 Space Filling Curves


4.3.1 UB-Tree
Universal B-Tree or UB-Tree [Bayer97] is based on classical B Tree. It is a solution which is easy to implement because its operations are based on the B -Tree. UB-Tree provides a good solution for the following design goals of multidimensional access for efficient query processing: Data belonging together have to be retrieved with minimum effort and time by clustering them together. Provide efficient organization for extensions of the database. Time and memory effort for insertion, deletion and queries should be computable. Optimal Memory Utilization. UB-Tree organizes data in an n -dimensional space such that they can be stored and managed on, retrieved and deleted from peripheral storage very efficiently. It preserves the clustering of objects with respect to their Cartesian distance. Its basic idea is to use a Space-Filling Curve [Saga94, Bial69]. These space- filling curves map a multidimensional space into a one-dimensional space while preserving multidimensional clustering. Since the performance guarantees for processing time are logarithmic in the number of objects in the data space, this method is suitable and robust for very large applications and scales very well to large problems as they arise. Real multidimensional clustering is achieved by partitioning the data space in many subspaces and assigning combined sub spaces to physical disk pages. The subspaces are numbered using a spacefilling curve. Thus records that are spatially adjacent reside in same pages on disk and therefore can be read with only one page access.

78

4.3.1.1

UB-Tree Vs Other Multi-Dimensional Access Methods

The main drawback of B-Tree is that it works well only for one dimensional data. Index structures like K-D-B-Trees, R*trees and Grid-files do not satisfy all of the above goals for example, for adding new attribute effort is much higher for these index structures. They also have complex algorithms for basic operations like insertion, deletion and queries. Implementing these methods was very expensive for database applications. For queries, UB-Tree has multiplicative complexity, which results in performance improvements over secondary indexes.

4.3.1.2

Salient Features

UB-Tree uses Z -curve to partition a multidimensional space disjunctively. Z -values are calculated by bit- interleaving method and the data are clustered according to their ordinal number on the Z -Curve. The space covered by an interval on the Z -curve is called Z region. Z-region is termed as Z-region [ :] and does not have more than two disconnected parts for them to be used for multidimensional queries and clustering. Arbitrarily switching can be done between the linear Z-space and the geometric interpretation. Most Z-region preserves spatial proximity. Important characteristics of UB-Trees: UB-Tree uses the structure of B-Tree to store the addresses of Z-regions. Has high potential for parallel processing. Dimensionality can be increased for including additional attributes. It preserves clustering with respect to Cartesian distance. It can be implemented on top of any database system or on an index structure by a preprocessing technique. For Updates only the UB-Tree has to be managed instead of several secondary indexes. Useful for databases where multiple secondary indexes are widespread, which can be replaced by a single UB-Tree index. Maps multidimensional space into a linear ordered space. The number of pages retrieved is related with the results sets size and results in a very desirable response time behavior. With larger database sizes the Z-region partitioning gets finer and query boxes can be better approximated by partitioning. If the query box is smaller than the Z-region, then only few regions must be retrieved. It clusters data more symmetrically and respects all dimensions. The more dimensions the query contains, the more accurate the query gets with a multidimens ional index.

79

Figure 14a: UB-Tree

Figure 14b: UB- Tree Structure The UB-Tree structure for the running example is shown in Figure 14a. The first n-1 nodes of the UB-Tree are used to store the addresses of the Z-Regions. These nodes are called as UB-Index. The nth node, called the UB-File stores the data objects. The ZCurves start at top- left corner and terminates at bottom-right. The curve reaches other points in the space between these two points. The advantage of this curve is that a twodimensional space is mapped on a one-dimensional space, the Z-Curve. Then B-Tree structure is used to manage the objects. Tetris-Algorithm [Mart99] is another way of arranging the curve. But the Z-region partitioning is the most effective way of partitioning multid imensional spaces.

80

4.3.1.3

Operations

UB-Tree requires linear space for storing data and logarithmic time for the basic operations like Search, Insert and Delete. It guarantees the utilization of space and regaining of unused space on the disk storage. Search: Regions that intersect a given query region have to be retrieved. Initially the first Z-region that overlaps the query region has to be retrieved. Then the next Z-region covering the query range is retrieved. This is repeated until the whole query range is covered minimally and the region containing the last point has been retrieved. Only the Z-regions that efficiently contribute to the result are retrieved and restrict attributes in several dimensions. This reduces the number of disc accesses and memory effort. The factors affecting range queries are: The position and volume of query box. Extension of the query boxes in each dimension Data distribution and split parameters of Z-region and Number of dimensions.

The number of pages retrieved is correlated with the result sets size and it results in a desirable response time behavior. With larger database sizes, the partitioning gets finer and query boxes can be better approximated. For range queries UB-trees retrieve fewer unwanted data, have faster insertion and less storage utilization than B-Trees. At first the search algorithm processes the Z-Addresses of the starting point and the finishing point b. The area covered by the query box is empty at the beginning. Then the Z-regions that cover the query region are recovered one by one. First the Z-region is searched that contains the first point of the query range according to the ordinal numbers on the Z-Curve. Then the tuples are chosen to cover the query range. Next the first point after that Z-region that overlaps with the query region is computed. The loop is performed until all the objects are returned. UB-Tree range queries restrict attributes in several dimensions decreasing the number of disc accesses and memory effort. Let N be the total number of objects in the database, m = M/2 and r be the number of regions intersecting the query. The time complexity of Point query is of O(logm N) and that of range query is r * O(logm N). Search(N,Q) Input: UB- Tree rooted at N and a,b: the tuples that define the query region Q. Output: Resultant set R of the range query.

81

Step 1: Let x = Z(a) and y = Z(b), the Z-addresses of the starting and ending point of the query region O and Result set R = NULL. Step 2: The Z-region [a:], such that a = a = , is searched in the UB-Tree. Step 3: For each entry E in [a : ], R = R U E is done if E ? [a, b]. Step 4: Let a = the first point overlapping the query range such that a > . Step 5: The Steps 2 to 4 are followed till a > b. Step 6: Set R, the resultant set is returned. Insertion: An object to be inserted into an existing UB-Tree is characterized by its attributes. According to its corresponding Z -Address the object belongs to a Z -Region. The object is inserted into a leaf node corresponding to that region, which is found by exact match query. For Objects which intersect several regions, the object is inserted into each region which it intersects properly. When this caused overflow the regions are split. The region is split by introducing new Z-area. Data is added until it reaches a specific capacity. The insertion algorithm determines the Z-address of the object to be inserted. Then the leaf node where the object has to be inserted is determined by exact match query. The object is inserted into the leaf node. In the node overflows it is split. For inserting an object which is inside a region the time taken is of the order of O(logm N) and that of objects covering more than one region is r * O(logm N), where r is the number of regions containing the object. Insert(N,O) Input: UB- Tree rooted at N and object to be inserted O. Output: Reorganized UB-Tree. Step 1: a = Z(O). Step 2: Search(N,O) is invoked to get the node N having Z-region as [a:], such that a = a = . Step 3: O is added into the N. Step 4: If N overflows Split(N) is invoked. Step 5: If there is split, the tree is updated.

82

Since pages can store a maximum of M objects, they may overflow and split is done as in B-Trees. Let e be a performance parameter. Split(N) Input: Node to be split N with Z-region as [a:] and a performance parameter e. Output: Nodes N and N created from N. Step 1: An address ? is selected in the Z-region [a:], so that the number of entries inside the region [a:?] is between (M/2 e) and (M/2 + e). Step 2: The node is split into two regions: [a:?] and [?+1:] . Step 3: The objects in the original node are distributed to the two new nodes. Deletion: When objects are deleted from a region, their corresponding entries are removed from its Z-region. If it results in an underflow, i.e. less than (M/2 e) elements, then the node is merged with the following node. The Z -regions corresponding to the previous regions disappear and a new Z-region is created. If there is an overflow in the resulting node, it is split using the splitting algorithm. The effect is that data is stored equally distributed and has the worst case storage utilization of 50%. This underflow technique is analogous with that of B-Tree. For deleting an object, its Z-address is computed. The Z-region which has the object in the UB-Tree is determined by exact match query. It is removed form the region. If there is an underflow after deletion of the entry, it is merged with the following Z-region. If the resultant node overflows, it is split. Deletion of an object inside a region is of O(logm N) and the time take for objects covering more than one region is r * O(logm N), where r is the number of regions intersecting the object. Delete(N,O) Input: UB- Tree rooted at N, Object to be deleted O and performance parameter e. Output: Re-organized UB-Tree. Step 1: a = Z(O). Step 2: Search(N,O) is invoked to get the node N having Z-region as [a:], such that a = a = . Step 3: O is removed from [a:]. Step 4: If N underflows, i.e. region [a:] has less than (M/2 e) entries, region [a:] is merged with its neighboring region [+1: ?] to get [a:?].

83

Step 5: If region [a:?] has more than M entries, Split(N) is invoked. UB-Tree is an effective and cheap solution to multidimensional queries. Implementation is not too expensive and complex. It requires linear space for storing and has logarithmic time for the basic operations like search, insertion and deletion. As it inherits its properties from B-Tree, has advantage over other multidimensional access methods. It is useful for geographic databases, data-warehousing and data-mining applications. As it preserves clustering, shows its main strength for multidimensional data.

4.4 Distance-based Indexing Methods 4.4.1 MVP-Tree


Similarity between images can be measured using features such as shape, color and texture. Indexing large metric spaces can be done in different ways. Many database applications use this for approximate match queries Distance transformation to Euclidean spaces can work for many applications but it makes the assumption that such a transformation exists. But it is not always possible and cost effective to employ the distance transformation. Distance based structures uses different approaches for finding best matching keywords to a given query. M-Trees [Ciac97] is one of the first structures for metric spaces which includes dynamic insertions and deletions. In Vantage-Point tree or VP-Trees [Uhlm91, Yian93], at every node of the tree, a vantage point is chosen among the data points and the distances of this vantage point from all other points are computed. These points are sorted into an ordered list with respect to their distances from the vantage point. Then the list is partitioned to create sub lists of equal cardinality. The order of the tree corresponds to the number of partitions made. In VP-Tree, since the vantage point for a node has to be chosen among the data points indexed below a node, the vantage points of the siblings are all different. In the construction of VP-Tree, for each data point in the leaves, the distances between that point and all the vantage points along the path are computed, but these distances are not stored in the tree. When constructing VP-Tree, selecting vantage point from corners of space leads to better partitions and better performance. Multi-Vantage-Point Tree or MVP-Tree [Bozk99] is similar to VP-Tree in the sense that both uses relative distance from a vantage point to partition the domain space. MVP-Tree uses more than one vantage point at each level. It partitions the data space into spherical cuts around vantage points. Two- level MVP-tree that keeps all vantage points in a single directory is an efficient structure for minimizing the number of distance computations in similarity queries. The following heuristics are used in designing MVP-Tree: Same vantage point can be used to partition the regions associated with nodes at the same level.

84

It is possible to store the distances of the data point from other vantage points in the leaf bodes which provides further filtering at leaf level during search operation.

For choosing vantage points in MVP-Trees the following heuristics gives better performance than selecting the vantage points randomly. This uses the concept that the farthest point from any given point is most likely to be a point that is close to a corner. A random point is selected. The distance from this selected point to all other points is computed. The farthest point is selected as the vantage point.

4.4.1.1.1 MVP-Tree Vs VP-Tree


MVP-Tree uses more than one vantage point at each level of the tree to increase the fan out of each node. MVP-Trees stores the computed distances of the data points from the vantage points and these distances are used for efficient filtering at search time. For higher dimensions, VP-Tree has more branching which affects the search performance. MVP-Tree has same vantage point for all the nodes at the same level. The data points in the leaf nodes store their distances from certain number of vantage points along the path. MVP-Tree performs better than VP-Trees. Increase in leaf capacity decreases the number of vantage points by shortening the height of the tree and delays the major filtering step to the leaf level. MVP-Tree has significant efficiency gain than VP-Trees in terms of number of distance computations for larger query ranges.

4.4.1.1.2 MVP-Tree Vs M-Tree


M-Tree is a dynamic index structure which is created in bottom- up fashion. M-Trees can handle dynamic insertions and deletions independent of the data distributions. M-Tree has the parameters minimum node utilization and page size, for tuning its performance. Performance of MVP-Tree degenerates faster by increasing the query range as compared to M-Trees. For smaller queries M -Trees has more overhead and performs less efficient than MVP-Trees. Partition strategy of M-Tree is different than that of the MVP-Tree. The performance of M-Tree in terms of number of distance computations is better for data set with physical clusters. In these cases the performance of MVP-Tree is comparable with that of M-Tree.

85

The spherical partitioning of the two-dimensional running example is illustrated in Figure 15a. The data objects are labeled with letters. The values for the parameters are: the number of partitions from each vantage point (m) is set to 2 and the number of vantage points at each node (v) is 2. The first vantage point b divides the data spaced into two groups. The second vantage point divides these two partitions into two more sets. At the next level h and H, J and C, F and G and E and a act as the 2 Vantage points for the four corresponding partitions obtained from the vantage points b and l. At the final step the objects g and D and f and e belong to same partition.

Figure 15a: MVP-Tree

Figure 15b: MVP-Tree Structure

4.4.1.2

Salient Features

In general MVP-Tree has the following parameters: Number of partitions created by each vantage point m, Maximum fan out for the leaf nodes k,

86

The number of vantage points used in an internal node v and Number of distances for the data points to be kept at the leaves.

Some of the salient features of the MVP- Tree are: MVP-Tree is created from an initial set of data objects in a top-down fashion and hence is a static index structure. MVP-Tree has v vantage points at every node. All the children nodes of a given node have same vantage point. MVP-Tree has larger fan outs. The fan out of an internal node is of the order of mv . Non-leaf levels have small number of vantage points. Leaf nodes store only the data points and p pre-computed distances for each data point. If consecutive vantage points are as far apart as possible, it would result in better utilization of the pre-computed distances. This is done in MVP-Tree by selecting the next vantage point in an internal node to be one of the farthest points from the first one. Using more pre-computed distance values p improves the search performance. This also reduces the cost of sorting and partitioning data objects during re-organizations. Using more vantage points and keeping small leaf nodes increases the trimming in the directory nodes.

4.4.1.3

Operations

MVP-Tree Construction: As MVP-Tree is a static index structure construction of a tree from the given set of data is important for efficient performance. During selection of vantage points for an internal node, the first vantage point is selected by using the following heuristics can be used to partition the space into m spherical shell like regions: Same vantage point can be used to partition the regions associated with nodes at the same level. It is possible to store the distances of the data point from other vantage points in the leaf bodes which provides further filtering at leaf level during search operation. The farthest point from the first vantage point is chosen as the second vantage point. The process continues the same way until all the v vantage-points are chosen and the data space is partitioned into mv regions. If k is large, the ratio of the number of vantage points versus the number of points in the leaf nodes becomes smaller. This makes it possible to filter out many distant points from further consideration by making use of the p precomputed distances for each point in a leaf node. For choosing vantage points in MVP-Trees the following heuristics gives better performance than selecting the vantage points randomly. This uses the concept that the farthest point from any given point is most likely to be a point that is close to a corner.

87

A random point is selected. The distance from this selected point to all other points is computed. The farthest point is selected as the vantage point.

In general MVP-Tree has the following parameters: Number of partitions created by each vantage point m, Maximum fan out for the leaf nodes k, The number of vantage points used in an internal node v and Number of distances for the data points to be kept at the leave - p.

MVP_Construct(S,D,m,v,k,p): Input: Set of n objects S, a metric distance function D and the parameters m, v, k, p for the MVP-Tree. Output: MVP-Tree with the given data. Step 1: If n = 0, an empty tree is created and the algorithm terminates. Step 2: If n = k then, a leaf node is created with objects in S and the algorithm terminates. Step 3: First vantage point V1 is selected from the set S. Step 4: V1 is deleted from the set S. Step 5: The distances D(Si,V1) are computed, where Si ? S. Step 6: For all levels less than p, for each Si ? S, Si.PATH[l] = D(Si,S) is computed. Step 7: The objects in S are sorted with respect to their distances from V1. Step 8: The sorted list is broken into m lists of equal cardinality and recording all the distance values at cutoff points. Step 9: For all the v vantage points the Steps 10 to 14 are followed. Step 10: The next vantage point V is selected from S, such that it is farthest from the current vantage point. Step 11: V is deleted from the sublist. Step 12: The distances D(Sj,V) are computed, where Sj ? S. Step 13: For all levels less than p, Sj.PATH[l] = D(Sj,S) is computed where Sj ? S.

88

Step 14: The objects are sorted according to the computed distances. The list is partitioned into m regions and the cutoff values are recorded. Step 15: MVP-Tree is created recursively on each of the mv partitions. The full MVP-Tree of height h has [v* (m2h 1) (m2 1)] vantage points which is v times the number of nodes in the MVP-Tree. The construction step requires O(n logm n) distance computations. Search: Search algorithm proceeds depth-first for MVP-Tree. The distances between the query object and the first p vantage points along the search path are kept for filtering data points in the leaves. Starting from the root, all the children whose regio ns intersect with the query region will be visited. When a leaf node is reached, the distant data points will be filtered out by looking at the p pre-computed distances from the vantage-points higher up in the tree. Search(N,Q,r): Input: Root node of MVP-Tree N and the query object Q and radius r. Output: Set of objects in Q is retrieved. Step 1: For each vantage point V, if the distance D(Q,V) = r , then V is in the answer set. Step 2: If N is a Leaf node, for each entry E in the node Steps 3 and 4 are performed. Step 3: The precomputed distances D(E,V) for each of the vantage point V is determined from the arrays stored in the node. Step 4: For each vantage point V if [ D(Q,V) r = D(E,V) + r ], then for each i of the p entries stored which satisfies: [ PATH[i] r = E.PATH[i] = PATH[i] + r ] AND D(Q,E) = r then E is in the answer set. Step 5: If N is a Non- Leaf node, the Steps 6 to 7 are performed. Step 6: For each level l that are less than p, PATH[l] = D(Q,V) is computed for each vantage point. Step 7: For each branch of the vantage point V which intersects with the query region, Search(E,Q,r) is performed where E is the node pointed by that branch. The efficiency of the search algorithm depends on the distribution of distances among the data objects, query range and the selection of vantage points. Even in the worst case the search algorithm can takes the number of distance computations is less than n. For larger queries, having large number of vantage points at the internal levels of the tree helps trimming the search better and for smaller query ranges the increase in number of distance computations results in worse performance than VP-Tree.

89

Insertion: Insertions can be done in MVP-Tree dynamically, though it is a static index structure, if the distribution of the data inserted conforms with the distribution of initial data set so that MVP-Tree grows smoothly and balanced. If the distributions are different, global restructuring have to be done which may require computing the distances. The numbering of distance computations required during the restructuring process depends on the number of pre-computed values p kept in the leaf nodes. If all the pre-computed values are kept, they can be reused during the restructuring process. Deletion: Like insertions, Dynamic Deletions can also affect the balance of the MVPTree and may require global re-organization. MVP-Tree is a static distance based index structure that can be used in any metric data domain. Does not make assumptions on the geometry of the application data space. Provides efficient filtering method for similarity search. Uses multiple vantage points to partition the data space using the pre-computed distances computed during the construction. MVP-Tree is constructed in top-down fashion from the given data set. As it is a static structure dynamic insertions and deletions require periodic global reorganizations which have to be cost efficient. If all the pre-computed distances are kept, the global re-organization can be done with minimum number of distance computations.

4.4.2 M-Tree
Mapping objects into feature vectors is not possible in all applications. Some similarities can be expressed as a metric distance between objects. These object distances can be used for query evaluation. In most applications a continuous distance function is used. Vantage point tree or VP-Tree [Bozk99] is a binary tree which uses some pivot element as the root and partitions the remaining data objects based on their distance with respect to the pivot element in two subsets. The same is repeated for subsets. The basic structure is same as that of VP-Tree. It combines the advantages of Spatial Access Methods like R-Tree [Gutt84] with the capabilities of metric trees to index objects using features and distance functions which do not fit into vector space. M-Tree [Ciac97] can process similarity queries by efficiently pruning the data search space and the query response is exact with no false drops. This is because the actual distance function instead of some approximation is used. Also the retrieved objects can be ranked according to their relevance to the query. The following two concepts are fundamental to the design of M-Tree: Database objects are recursively organized by considering their distances from the reference or routing object. The Reference or Routing Objects are data objects which acquire their routing roles according to specific promotion algorithms.

90

For efficient storage organization following requirements are to be considered: The tree should have fixed size nodes for optimizing the design of trees which reside in external memory devices, a basic requirement for large database. The Tree has to be balanced, with paths from the root to leaves all having the same length. It has to be dynamic, i.e it must be able to deal with insertions and deletions without degrading search performance and storage utilization and completely avoiding reorganizations which are costly.

4.4.2.1

M-Tree Vs VP-Tree

M-Tree is designed for secondary memory and allows overlap in the covered areas to allow easier updates. VP-Tree constructs index tree by a top-down recursive process and hence is not guaranteed to remain balanced in case of insertions and deletions. M-Tree always remains balanced and can be applied where the distance function is a metric. VP-Tree requires costly reorganization to prevent performance degradation. Among all metric index structures, M-Tree i the only one which is optimized for s large secondary memory based data sets. All others are main memory index structures supporting small data sets. No other index structure could solve similarity queries on generic metric space and at the same time satis fy the above requirements.

Figure 16a: M-Tree

91

Figure 16b: M- Tree Structure It can be seen from the M- Tree structure of the running example shown in Figure16a that objects B, H and the Routing objects at top level. At the next intermediate level, objects a, B, G, c, H and J are routing objects. All other data are Ground Objects stored in the leaf nodes.

4.4.2.2

Salient Features

An object is called Ground object if its entry is stored in the leaf node. All other nodes are called Routing objects. Objects g inserted as ground node and later move up the et tree. M-Tree objects are moved up or promoted and moved down or degraded, within the tree according to tree maintaining algorithms. Two styles of promotion are there: Promotion by partitioning and Promotion by voting Promotion by Voting : A single object is selected from the set for promotion. During degradation of an object from level l to l, if l > l+1 then promotion by partition results in orphan nodes between levels l and l. To prevent this inconsistency an entry is promoted from each orphan node to the next higher level. Promotion by Partitioning : Two objects are selected for promotion and other objects are assigned to two disjoint subsets depending on their relation to the selected objects When a node N at a level l overflows and must be split the following sequence of actions takes place: The parent object of N is degraded to level l. A new node N is created. Two new routing objects O and O are selected from the set of entries in N including the parent of N and added to N and N respectively.

92

Depending on their relation with O and O, the remaining objects are distributed among N and N. Objects O and O are promoted to level l+1 and stored in the common parent node of N and N.

Depending on the nature of indexed objects and information stored in M -Tree, the following kinds of structures are possible: As only the pointer to objects is stored in M- Tree and not the information about the objects themselves, for every distance computation requires additional access to objects stored in a separate data file. Only the features needed for distance computing are stored in M-Tree, whereas the actual objects are stored separately. Complete data objects are stored in M-Tree, acting as a primary data organization. Depending on the object size, size of index features and complexity of distance computation one method is selected. Properties of M- Tree are: Ground node entries contain only the information of ground objects. Except the root node, other Routing nodes contain specific number of object entries for which level l is the highest and one entry for the routing object in its parent level. M-Tree grows in a bottom- up fashion. Objects in the root of a sub-tree are univocally associated with parent routing object. Covering Region of a routing object is the region having all objects from the routing object that are at a distance less than or equal to the covering radius of the routing object. The Covering tree of an object is always a subset of its cove ring region, there can be an object which is in the covering region but not stored in the sub-tree. Every new object is inserted as a ground object and later gets promoted as a routing object. Degradation from level l to a level l of an object (l > l) implies that the object loses all its properties at levels (l+1) to l.

4.4.2.3

Operations

Search: Given a query object Q and the radius to be searched r, the search algorithm follows all paths that cannot be excluded from leading to relevant objects. A queue of Pending Requests (PR) containing pointers to nodes still waiting to be examined is maintained. Initially the PR queue ahs the entry pointer(N), where N is the root node. Search(N,Q): Input: M- Tree rooted at N and the query region Q. Output: All objects in the query region.

93

Step 1: The entry of N is removed from the PR queue. Step 2: For each object O in the set of entries in N including l- level information of Ns parent object, Steps 3 and 4 are performed. Step 3: If l is the highest level of O and d(O,Q) = r(Q), O is added to the resultant set. Step 4: If N is not a leaf and d(O,Q) = r(Q) + r(O), then the pointer to the sub-tree of O is added to the PR queue. The last step uses the triangle inequality to prune the search space. If d(O,Q) > r(Q) + r(O), then since for any object E in the tree rooted by O it is d(E,O) = r(O) and d(E,Q) = d(O,Q) d(E,O) follows from the triangle inequality, it is true that d(E,Q) > r(Q). This guarantees that the subtree rooted by O doesnt contain relevant objects. Insertion: The algorithm for inserting a new object O first descends the M-Tree to locate the most suitable leaf node to include O. If insertion into the chosen node causes overflow, split the node and propagated up to the higher levels of the tree. If the node to be split N is the leaf node, insert O into N and invoke overflow management procedure. A node N at level k has set of entries which includes k- level information of its parent node P, i.e. has Pointer(T(P)) and r(P), where T(P) is the subtree rooted at P. This algorithm resembles as that of R- Tree and hence multiple paths have to be traversed during exact match queries. New objects are added to the leaf nodes and in case of overflows, the parent object of this node is degraded and inconsistency is resolved using promotion by partitioning and promotion by voting. While inserting a new object, following alternatives could be used in selecting a path to be traversed: Minimum increase in radius: An entry E is selected from current node N at level k which minimizes [d(E,O) r(E) > 0]. Minimum overlap with other regions: An entry E in N for which the new value of the covering radius rnew(E) minimizes: ? H(r(Oi) + rnew(E) d(Oi,E))). Oi?N, Oi?E Where H() is a ramp function: if x = 0 then H(x) = x else H(x) = 0. The term H(r(Oi) + rnew(E) d(Oi,E))) is a measure of overlap of the closures of cover regions. Thus it makes sense to choose the entry E, which leads to a minimum global overlap. Insert(N,O): Input: M- Tree rooted at N and object to be inserted O. Output: Reorganized M-Tree with O inserted.

94

Step 1: If N is a leaf node, O is inserted in N and if required overflow management is performed. Step 2: If N is a non-leaf node, the following steps are followed to choose the subtree to be traversed. Step 3: Le t N be the set of entries in N such that, N = {E ? N| d(E,O) = r(E)} and N = N N. Step 4: If the set N is not empty, a tie-breaking function is used to select an entry E from N. Step 5: If N is empty, an entry E selected from N using a different tie-breaking function. Step 6: Let N = Pointer(T(E)). Insert(N,O) is invoked. Node Splitting: Given a set of entries, during splitting specific algorithms are needed for: Choosing two new routing objects for promotion. Assigning objects to the two ne w sets. A good split strategy has the following requirements: For efficient pruning, the covering radius should be low. There should be minimum overlap to avoid multiple paths being traversed. After splitting the two new sets should have same number of entries or at least a minimum value should be specified for each new set. When distance computations are costlier the number of distance computations can be minimized: Storing pre-computed distance values in nodes, with a trade-off between the fan-out and the number of pre-computed distances. Avoiding computing pair-wise distances when a node is split, but this during retrieval there would be reduction in efficiency. Using approximate distance functions which are easier to compute. Promotion by Partitioning Two objects are selected for promotion and other objects are assigned to two disjoint subsets depending on their relation to the selected objects. Split(N): Input: The node to be split N, at level l. Output: Nodes N and N. Step 1: The parent object of N, P is degraded to level l.

95

Step 2: A new node N is created. Step 3: Two new routing objects O and O are selected from the set of entries in N including P and stored in N and N respectively. . Step 4: The remaining objects are distributed among the two nodes N and N depending on their relationship with the objects O and O. Step 5: Objects O and O are promoted to level l+1 and stored in the common parent node of N and N. Deletion: The deletion process performs an exact match query in the M-Tree and reaches the leaf node which has the object to be deleted. The entry is removed. A l- level object may have to be degraded and moved to a lower level during deletion. Promotion by Voting In this method, a single object is selected from the set for promotion. During degradation of an object from level l to l, if l>l+1 then promotion by partition results in orphan nodes between levels l and l. To prevent this inconsistency an entry is promoted from each orphan node to the next higher level. M-Tree was designed for similarity based retrieval on dynamic sets of objects for which only a metric function on relevant features is defined. Once an object is promoted, it is associated with a covering radius which defines a sphere centered on the object and within which all the objects in the corresponding routing object sub-tree are located. MTree requires functions to compute the centre, which can be complex and time consuming. Introducing additional features can increase the storage overheads when comp lex features are used. When more information is available other specific index structures such as R-Trees and SS-Trees can be used.

4.5 Signature Methods


4.5.1 VA-File
Similarity Search is an important aspect in multimedia information retrieval and data mining. This similarity is measured on the features of an object called Feature Vector. Space-partitioning methods like Grid-File [Sevc94], K-D-B-Tree [Robi81] and Data partitioning methods like R-Tree [Gutt84], R*-Tree [Krei90] and X- Tree [Krei96] were proposed. These methods work only for low-dimensional spaces. Their performance degrades as the number of dimensions increases, termed as Dimensional Curse. Vector Approximation File or VA-File [Blot98] is a filter based method that overcomes the difficulties of high dimensionality. It is based on the idea of object approximation, where compression and quantization concepts are used to reduce the amount of data without losing too much information. It assumes that data and query points are uniformly

96

distributed within the data space, and dimensions are independent. It reduces the amount of data that must be read during similarity searches. VA-File is an array of compact, geometric approximations of objects, which are scanned and each approximation determines a lower and upper bound on the distance between its data point and the query. These bounds filter most of the vectors from the search.

4.5.1.1

VA-File Vs Other Partitioning Methods

The performance Data-Partitioning methods degrade exponentially as dimensionality increases. Also the increasing overlap in the data-partitioning methods like R*-Trees affects search performance as the dimensionality increases and in some cases even searches more blocks than sequential scan for higher dimensions. VA-File retains good performance as the dimensionality increases and even improves with increase in dimensionality. Unlike other methods, VA-File is a simple flat structure. The data need not be partitioned and clustered. As it is a flat structure, Distribution, parallelization, concurrency control and recovery are simplified.

4.5.1.2

Salient Features

VA- file, which is a signature based method, is an array of bit vector approximations based on quantization. Signatures are more compact than vectors and provide approximation to the information in vectors. Each d-dimensional data is represented by b bits. Data space is divided into 2b rectangular cells. Each cell has a bit representation of length b. Each data is approximated by bit representation of the corresponding cell. The approximation can be used to derive the bounds on the distance between a query point and a vector. Lower-bound is the shortest distance from the query to a point in the cell and Upperbound is the longest distance to a point in that cell.

Figure 17a illustrates the structure of VA-File for the running example. Number of dimensions (d) is 2. The number of bits assigned for each of the two dimensions is 2. The data space is partitioned into 24 = 16 cells represented by bits string of length 4. Each data object is approximated by a bit string of the cell into which it falls. For example, the data G can be approximated by the bit string 0110 and h by the string 1101. These bit strings are stored in the VA-File.

97

Figure 17a: VA-File

Figure 17b: VA-File Values

4.5.1.3

Operations

In VA-File it is possible to pre compute the distance from the query point to each one of the partition points in all the dimensions. This reduces the CPU costs of the search algorithms of VA-File. Search: Simple Search Algorithm (VA-SSA): In this procedure, two arrays are maintained for storing the k nearest neighbors, sorted in order of increasing distance. The approximations are scanned linearly. Initially when there are less than k vectors in the list, a new vector is added. Whenever the lower bound of a new vector is less than the k-th vector in the list, it becomes a candidate vector in the 98

list. For only those vectors in the list the actual distance are computed and these points a revisited. The advantage of the above filter based approach is that it is simple, has low memory overhead and sequential access of vectors. Its performance depends on the ordering of the vectors and approximations. Procedure VA-SSA(Q): Input: The Query Vector Q. Output: An Array A of the k- nearest ne ighbors. Step 1: The distance array D is initialized with a maximum value, say MAX. Step 2: For each approximation A, the steps 3 to 5 are performed. Step 3: The lower bound, L, of A with Q is computed. Step 4: If L is less than MAX and MAX is less than the last element in the D array, then the result set A and D arrays are updated and both arrays A and D are sorted based on the distance. Step 5: MAX is set to the last element in the D array. Near-Optimal Search Algorithm (VA-NOA): This algorithm decreases the IO cost by minimizing the number of vectors visited. This algorithm has two phases. In the first phase the approximations are scanned for information gathering. The lower and upper bounds for each vector is determined and then filtering is done. During filtering, if the lower bound of current vector is larger than the kth largest upper bound encountered so far, the new vector can be removed. By this most of the vectors are pruned down to some candidate vectors. During the Second phase, the vectors selected are sorted in increasing order of lower bounds. Then each one is visited and stops once the kth lower bound is encountered. A heap-based queue supports this procedure. Procedure VA-SSA(Q): Input: The Query Vector Q. Output: The resultant set of point within the query range. Phase-1: Step 1: The distance array D is initialized with a maximum value, say MAX. Step 2: For each approximation A, the steps 3 to 6 are performed. Step 3: The lower and upper bounds, L and U, of A with Q is computed.

99

Step 4: If L is less than or equal to MAX and MAX is less than the last element in the D array, then the result set A and D arrays are updated and both arrays A and D are sorted based on the distance. Step 5: MAX is set to the last element in the D array. Step 6: L along with the number of the vector is inserted into the heap. Phase-2 Step 1: The distance array D is initialized with a maximum value, say MAX. Step 2: The lower and upper bounds, L and U, are taken from the top of the heap. Step 3: Until L is less than MAX, the steps 4 to 5 are followed. Step 4: If MAX is less than the last element in the D array, then the result set A and D arrays are updated and both arrays A and D are sorted based on the distance. Step 5: The lower and upper bounds, L and U, are taken from the top of the heap. Update: General update operations, which include insertion and deletion, can be done easily but the performance may degrade if too much fragmentation is allowed to occur. As this method is suitable for static data bases, update operations without affecting the data distribution of the database can be performed. VA-File is an approximation-based index method for high dimensional vector spaces. It overcomes the dimensionality curse, which affects the performance of the existing datapartitioning index structures. It has a simple flat structure. VA-File data space partitioning is similar to those of Grid-Files and other hashing methods. The number of bits selected and the allocation of bits to the dimensions are important aspects. Unlike other structures, the performance of VA-File improves slightly with the increase in dimensionality. This method is suitable for static databases or for those where the data distribution is relatively static. VA-File can be preferred for similarity search over large uniformly distributed data sets. A good combination of the concepts of VA-File with tree based index structures, could reduce the amount of data to be observed by the search algorithm and a good partition can be defined for parallelization and distribution.

4.5.2 A-Tree
In Content-based information retrieval uses Feature-Vectors extracted from images. Since retrieving high-dimensional feature vectors incurs high cost for large data sets, efficient spatial index structures and search methods are required. Data-partitioning methods using Minimum-bounding blocks such as R*- Tree [Krei90], X-Tree [Krei96] and SR-Tree

100

[Kata97] and Approximation based schemes such as VA-Files [Blot98], which is an array of geometric approximations of vectors. A-Tree [Saku00] is based on the concepts of SR-Tree and VA-File. It is applying the notion of relative approximation to the hierarchical structure of SR-Tree. The basic idea is using Virtual Bounding Rectangles (VBR) containing MBRs and approximating data objects. The VBRs can affect the performance. Since each node can store large number of VBRs, the fan out increases resulting in faster search. The nodes contain entries of an MBR and its children VBRs and hence by fetching a node in A-Tree the exact position of a parent MBR and approximate position of its children can be obtained.

4.5.2.1

A-Tree Vs SR-Tree and VA-File

SR-Tree performs better than VA-File for non-uniformly distributed data sets and has higher search performance. In VA-File the approximation is based on absolute positions. And this may lead to approximation errors for skewed data. So VA-File is not effective for non-uniformly distributed data. In SR-Tree the size of the entries in a node is proportional to the dimensionality resulting in smaller fan outs affecting the search performance. In SR-Tree as the dimensionality increases, the contribution of Minimum-bounding Spheres in node pruning becomes small. A-Tree uses relative approximation, where bounding regions are approximated by their relative positions in terms of parents bounding region. Unlike VA-File, the approximation values of relative approximation in A-Tree changes with the data distribution. Compared to VA-File, A-Tree provides high accuracy for VBRs due to the difference in data structures and algorithms. As the approximation values can be represented compactly resulting in larger fan out. As the effect of bounding spheres is limited in high dimensional spaces, they are not stored in A-Tree and the centroids are used for updates. The CPU time of VA-File is higher than that of the A-Tree. Even though A-Tree has to calculate and compare distances, due to the few node accesses and the filtering approaching used on the queues. A-Tree with full utilization is suitable for static data set as its insertion cost is larger than that of SR- Tree. At lower dimensions A-Tree and SR-Tree have similar storage costs, but at higher dimensions A -Tree has lower costs, even though it has to include VBRs as VBRs need small storage volumes and due to its large fan out.

101

Figure 18a: A-Tree

Figure 18b:A-Tree Structure A-Tree includes MBRs, data objects and the Subspace code of the VBRs. The A -Tree structure for the running example is explained in Figure 18a. The box represents the entire two-dimensional data space consisting of both spatial and point objects. M7 containing the MBRs M1 to M3 and M8 containing M4 to M6 are contained represent the space. V7 and V8 are the VBRs of the MBRs M7 and M8 at the root. V1, V2, V3 are the VBRs of M1, M2 and M3 respectively in MBR M7 and V4, V5, V6 are the VBRs of M4, M5 and M6 respectively.

102

4.5.2.2

Salient Features

Virtual-Bounding Rectangle or VBR is a rectangle that contain and approximate an MBR or a data object. The children MBRs and data objects are approximated as VBRs by relative position in terms of their parent MBR. The basic idea of relative approximation is to quantize the interval of the bounding region of child relatively to that of the interval of its parent. The binary codeword used to represent the VBRs is called as Subspace code. Some trade off can be done between the length of subspace code and approximation error. A Tree uses a concept of Full Utilization which fully uses all disk pages. This reduces approximation error. Following are the important features of A-Tree: A-Tree is a Tree structured index structure. The representation of MBRs and data objects is based on approximation relative to their parent MBRs. A-Tree is a dynamic structure in a sense that the approximate values change with the change in the distribution of data sets. Compact representation of the approximation values results in larger fan out and reduction in number of node accesses. Partial information of bounding regions of two levels can be obtained from a single node. This is useful during update operations. The centroid of data objects are used only for the update operations. Index nodes other than root node contain one MBR and subspace codes of VBRs of the MBRs in children nodes. The Data nodes have entries of the form (V,Pointer) where V is the spatial vector of an object and pointer to the object. The leaf nodes contain the MBR of the entries in the data node, pointer to the data node and subspace code for the entries in the data node. An intermediate node contains MBR of its children nodes MBRs and list of entries of the form (Pointer,SC(V),w,C), where Pointer is the pointer to the child node, SC is the subspace code of a the VBR V of a child node, w is the total number of entries in the sub tree rooted at the node and C is the centroid of data objects in the sub tree. The root node has no MBR, as it has the entire data space. The subspace codes are calculated from the information of entire data space and children MBRs. It has entries of the form (Pointer,SC(V),w,C).

4.5.2.3

Operation

Search: The general k-Nearest-Neighbor Search algorithm calculates the minimum distances between the MBRs in a node and the query point and is kept in a priority queue in sorted order. Nodes are visited from the top of the queue until the queue becomes empty. Another list of k-nearest neighbors is also maintained. The priority queue is pruned by eliminating nodes whose distance to the query is longer than that of the kth

103

object in the list. A -Tree has these lists in the form of Nearest-Neighbor Object list (NNOL ) and Nearest Neighbor VBR List (NNVL). Search(Q,k): Input: query point Q and no. of objects to be retrieved k. Output: k-nearest-neighbor objects of Q. Step 1: The priority queue is initialized with the entry of the root node. The lists NNOL and NNVL are initialized with the distances equated to infinity. Step 2: An entry is extracted from the top of the priority queue. Then it traverses the pointer and gets a node. Step 3: If the extracted node N is a Data node, for each entry E in N, If the distance between Q and E is less than or equal to the kth nearest neighbor object found so far, steps 3.1 to 3.3 are performed. Step 3.1: Entry E together with its distance is stored in NNOL as nearest-neighbor candidate. Step 3.2: Then NNOL is sorted by distance. Step 3.3: Then the priority queue is filtered using NNOL. Step 4: If the extracted node is an Index node N, for each entry E in N the steps 5 to 8 are followed. Step 5: The positions of VBR is calculated from the MBR of N and the subspace code. Step 6: If the distance between Q and a VBR is less than or equal to the k-the distance in NNOL and NNVL, the steps 6.1 and 6.3 are performed. Step 6.1: Entry E and its distance are inserted into the priority queue. Step 6.2: The queue is sorted in ascending order of the distance. Step 6.3: If N is a leaf node and the distance is less than or equal to the kth distance in NNVL steps 6.3.1 to 6.3.3 are followed. Step 6.3.1: The kth entry in list NNVL is updated with the maximum distance from query point to each VBR. Step 6.3.2: Then the list NNVL is sorted by distance values.

104

Step 6.3.3: The queue is size is reduced by eliminating pairs in the queue whose distance to Q is longer the kth entry in NNVL. Update: The update Algorithm of A- Tree is similar to that of SR-Tree. Starting from data object insertion or deletion, the centroids in the non- leaf nodes are adjusted. Here the codes of the VBRs are to be calculated and updated along the path. Procedure-1 - Update(N): Input: A-Tree after an entry is inserted or deleted with node N as the data node. Output: Reorganized A-Tree. Step 1: The MBR entry in Ns parent and the centroid of all data objects in N are adjusted. Step 2: If MBR of N remains unchanged, code of VBR approximating the new object inserted from MBR of N is calculated and updated in Ns parent. If there is any change in MBR of N, the codes of all the VBRs stored in Ns parent are updated. Then N is set to its parent node. The steps 3 and 4 are followed for all the index nodes. Step 3: If insertion or deletion occurs in the sub tree rooted at N, the centroid of all objects contained in the sub tree are updated stored in Ns parent node. If there is a change in MBR of a child node, MBR of N is adjusted. Step 4: If MBR of N remains unchanged, the code of VBR that approximates the updated child MBR is updated. If there is any change in MBR of N, the codes of all VBRs in Ns parent are updated. When the technique of full utilization is implemented, the code length for the approximating MBRs or data objects in each node varies. Therefore, if the MBR or the number of entries in a node is changed, the codes of all the entries must be calculated. The corresponding Update procedure using full utilization differs from the above algorithm in the second and final steps. Procedure-2 - Update(N): Input: A-Tree after an entry is inserted or deleted with node N as the data node. Output: Reorganized A-Tree. Step 1: The MBR entry in Ns parent and the centroid of all data objects in N are adjusted. Step 2: If MBR and number of entries of N remains unchanged, code of VBR approximating the new object inserted from MBR of N is calculated and updated in Ns parent. If there is any change in MBR of N or number of entries in it, the code length

105

assigned to each dimension for approximating all data objects from M and number of entries is calculated and the codes of all VBRs stored in Ns parent are updated. Step 3: If insertion or deletion occurs in the sub tree rooted at N, the centroid of all objects contained in the sub tree are updated stored in Ns parent node. If there is a change in MBR of a child node, MBR of N is adjusted. Step 4: If MBR and the number of entries of N remains unchanged, the code of VBR that approximates the updated child MBR is updated. If there is any change in the MBR of N or in its total number of entries, the code lengths assigned to each dimension for approximating all children MBRs are calculated and the codes of all VBRs in Ns parent are updated. A-Tree is an approximation based index structure developed for similarity search of high dimensional data after analyzing SR-Tree and VA-File. The basic idea is to use VirtualBounding-Rectangles which represent data objects compactly. Since tree nodes can have large number of entries of VBRs, fan out becomes large. Nodes contain entries of an MBR and its children VBR. Due to the use of relative approximation A-Tree has high performance. A-Tree is efficient for non- uniformly distributed data sets. The technique of full utilization provides higher search performance since the approximation errors of VBRs are reduced. The storage cost is low and has high search performances.

5. Comparative Study
Various factors that affect the performance of index structures are type of operating system, buffer size, data distribution, volume, density and degree of clustering of the data. The performance is measured in terms of number of disk accesses and time taken for operations such as search and dele tion. In this section, we briefly summarize various theoretical and experimental performance comparisons of various index structures that have been conducted in the literature. [Gree89] compares the search performance of R-Tree, R+- Tree and K-D-B- Tree for 10,000 uniformly distributed data. Page and query sizes were used as parameters. The results showed that R-Tree and its variant outperformed K-D-B-Tree in all aspects. When overlap between data rectangles is minimum R+-Tree performs better than the remaining R-Tree. [Schn90] studied the performance of hB-Tree, Buddy-Tree, BANG file [Free87] and RTree with different distributions of the data objects. The Buddy- Tree and BANG-file performed better hB-Tree for cluster data and query range of size 10% of the data. When the query range dropped to about 0.1%, buddy performed faster than BANG-file. For all data distributions BANG file and Buddy-Tree performed better than R-Tree, when number of page accesses was used as the benchmark. [Schi90] compares the performance of R*-Tree with the other variants of R-Tree for different data distributions based on the number of page accesses. R*-Tree had best

106

storage utilization and outperformed all R-Tree variants. [Falo94] showed that Hilbert-RTree gives better search results compared to R*- Tree. When Hilbert-curves were used for bulk insertions R*- Tree can give better performance results. [Hoel92] gives a qualitative performance comparison of the three structures R*-Tree, R+Tree and PMR-quadtree [Nels87] on databases of 50,000 line segments by conducting spatial testing. Due to non-overlapping property, R+-Tree performed better than R*-Tree and PMR-quadtree performed better than the other tow structures marginally. Due to the use of non-zero objects, this study did not give a superior structure. [Papa95] analyzed the topological associations of overlap, inside, contains, disjoint, covered-by, covers and meet between MBRs. Three databases of 10,000 objects and varying MBR sizes were used to study the performances of the three structures R-Tree, R+-Tree and R*- Tree. R+- Tree performed better than R*-Tree and both of them outperformed R-Tree for smaller MBRs. However, due to duplication of objects, R+-Tree performed poorly than the other two structures for larger MBRs. Experimental results from [Papa95] and [Gree89] suggest that R+-Tree does not work for high data density. [Kata97] uses 16 dimensional synthetic and real data sets to give a comparison study of the structures K-D-B-Tree, VAMSplit R-Tree [Whit96], R*-Tree, SS-Tree and SR-Tree. CPU-time was used as the performance measurement. SR- Tree outperformed all other structures. SR-Tree and SS-Tree performed better than K-D-B-Tree and R*-Tree. Since VAMSplit R -Tree is an optimized structure using the prior knowledge of the dataset, it outperforms others. SR-Tree, which is a dynamic index structure, shows comparable performance with VAMSplit R-Tree which is a static structure. [Blot98] gives performance comparison of VA-File with X-Tree and R*-Tree. The measurements were done based on number of blocks visited for nearest neighbor search queries of size 10 and varying the number of dimensions. For uniformly distributed synthetic data set and real data set from images of 50,000 objects, the performances of XTree and R*-Tree degenerated to linear scan. VA- file improved with dimensionality and outperformed other tree-structures in the block reads and faster search times especially for higher dimensions. X-Tree performed better than R*-Tree at lower dimensions. [Wang98] compared the performances of PK-Tree, SR-Tree and X- Tree. With uniformly distributed, real and clustered data sets, the CPU-time of PK-Tree was smaller than that of X-Tree due to the presence of supernodes. The number of I/Os are similar for X-Tree and PK-Tree. SR-Tree performed poorly in both these due to its large size. As the number of dimensions increases, the number of overlaps between the partitions increases in X-Tree and SR-Tree. Since there is no overlap among the sibling nodes in a PK-Tree, at higher dimensions it outperforms X-Tree and SR- Tree by a large margin [Saku00] uses three types of data sets to compare the performances of VA-File, SR-Tree and A-Tree. A-Tree performs almost equal to the VA- file for uniform data sets, and outperforms other structures for non-uniformly distributed data. One of the main disadvantages of SR-Tree its high search cost due to the storage of minimum bounding

107

spheres and rectangles. Since A- Tree is based on relative approximation and storage cost of virtual bounding rectangles are small, it results in better performance. Insertion tine of A-Tree is greater than that of SR-Tree A-Tree has to update VBRs in addition to the MBRs. VA-File takes more CPU-time compared to other structures and as dimensionality increases the performance of A-Tree is almost equal to that of SR-Tree. Appendix A-Table 1 ([Bohn01], [Brow98]) shows some important properties of the structures discussed in this paper.

6. Future Work and Conclusion


This paper provides a survey of multi-dimensional access methods. It classifies the structures based on their concepts. In spite of wealth of spatial access methods, a single access method which outperforms all other structures under all conditions could not be selected. For future work, it would be useful to perform a detailed study of all of the structures using same data sets under different distributions. The storage- utilization of all these structures for the data sets should be analyzed. These experiments should determine the applications that are best suited for a given access method. Another area of improvement is developing new structures based on the concepts of two or more methods which already exist, considering the pros and cons of each structure. The existing index structures can be optimized to improve its performance. One of the ways of doing this is by developing packing/bulk- loading algorithms for structures to improve its utilization and search performances. Several generic bulk loading algorithms were proposed in [Berc97] and [Krie99]. [Kame93] proposed bulk loading of R-Trees using hilbert-curve. Here the data objects are linearly ordered based on their hilbert-values and then the tree is created in bottom- up fashion. Fast and Efficient algorithms for nearest neighbor search in the index structures for high dimensional spaces were proposed by [Berc98], [Falo96], [Lipk98], [Papa97] and [Rous95]. Various Architectures were proposed for parallelizing the existing index structures such as R-Trees by [Kame92], [Leut99], [Xiao97] and [Mute00].

108

References
[Ahn01] Hee-Kap Ahn, Nikos Mamoulis, and Ho Min Wong: A Survey on Multidimensional Access Methods, Technical Report, UU-CS-2001-14, May 2001, Institute of Information and Computing Sciences, Utrecht University. Bayer, R. and McCreight, E.:Organization and Maintenance of large ordered Indexes. Acta Informatica 1,3 (1972), 173-189. R. Bayer: "The universal B-Tree for multidimensional Indexing: General Concepts" WWCA '97. Tsukuba, Japan, LNCS, Springer Verlag, March, 1997 R. Finkel and J.L. Bentley: Quad trees: A data structure for retrieval of composite keys, Acta Informatica, 4(1):1--9, 1974. J.L.Bentley: Multidimensional Binary Search Trees Used for Associative Searching. Communicatio n of the ACM, 18(9):509-517, 1975. J.L.Bentley: Multidimensional Binary Search Trees in Database Applications. IEEE Trans. on Soft. Eng., Vol. SE-5, No. 4, July 1979. J. van den Bercken, B. Seeger, P. Widmayer: A generic approach to bulk loading multidimensional index structures, Proceedings of the 1997 VLDB, Athen. S. Berchtold, B. Ertl, D. A. Keim, H.-P. Kriegel, and T. Seidl: Fast nearest neighbor search in high-dimensional space, In Proceedings of the Fourteenth International Conference on Data Engineering (ICDE), 1998. Bially, T: Space Filling Curves: Their Generation and Their Application to Bandwidth Reduction, IEEE Trans. Information Theory, Vol. IT-15, No.6 November 1969, pp658-664. R.Weber, H-J. Schek, S.Blott: "A Quantitative Analysis and Performance Study for Similarity Search Methods in High-Dimensional Spaces" In. Proc. Of the 24th International Conference on Very Large Databases, pp. 194-205, New York City, NY, August 1998. Christian Bohm, Stefan Berchtold, Daniel Keim: Searching in HighDimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases, ACM Computing Surveys, 2001. Tolga Bozkaya and Meral Ozoyoglu: Distance-Based Indexing for HighDimensional Metric Spaces, In Proceedings of ACM SIGMOD, 1997.

[Baye72]

[Baye97]

[Bent74]

[Bent75]

[Bent79]

[Berc97]

[Berc98]

[Bial69]

[Blot98]

[Bohn01]

[Bozk97]

109

[Bozk99]

Bozkaya T., Ozsoyoglu M.: "Indexing large metric spaces for similarity search queries" ACM Trans. Database Systems. 24,3 (Sep. 1999) Pages 361-404. Brown, L. and L. Gruenwald: Tree-Based Indexes for Image Data, Journal of Visual Communication and Image Representation, Volume 9, Number 4, December 1998, pp. 300-313. Ciaccia, P., M. Patella, and P. Zezula: 1997: "M-tree: an efficient Access Method for Similarity Search in Metric Spaces" In: Proc. of the 23rd Conference on Very Large Databases (VLDB'97). pp. 426435 D. Comer: The Ubiquitous B-tree. Computing Surveys, 11(2):121--137, 1979. Faloutsos, C. and Roseman, S: Fractals for Secondary Key Retrieval, ACM Symposium on Principles of Database Systems, March 1989, pp247-252. Ibrahim Kamel, Christos Faloutsos: Hilbert R -tree: An Improved R-tree using Fractals, VLDB 1994: 500-509. Flip Korn, Nikolaos Sidiropoulos, Christos Faloutsos, Eliot Siegel, Zenon Protopapas: Fast Nearest Neighbor Search in Medical Image Databases, VLDB 1996: 215-226. Volker Gaede , Oliver Gnther: Multidimensional access methods, ACM Computing Surveys (CSUR), v.30 n.2, p.170-231, June 1997. Diane Greene : An implementation and performance analysis of spatial data access methods, In Proceedings of the ACM SIGMOD, 1989. O.Guenther and A.Buchmann: Research Issues on Spatial Databases. SIGMOD Record (ACM Special Interest Group on Management of data), 19(4): 61-68, December 1990. O. Gunther and J. Bilmes: Tree-based access methods for spatial databases: Implementation and performance evaluation, IEEE Trans. on Knowledge and Data Eng., pages 342--356, 1991. R. Guting: An introduction to spatial database systems, VLDB Journal, 3(4), 1994 A.Guttman: "R-Trees: A Dynamic Index Structure for Spatial Searching", Proc. ACM SIGMOD Conference, Boston, 1984, pp.47-57.

[Brow98]

[Ciac97]

[Come79]

[Falo89]

[Falo94]

[Falo96]

[Gaed97]

[Gree89]

[Gunt90]

[Gunt91]

[Guti94]

[Gutt84]

110

[Hinr85]

K. Hinrichs : Implementation of the grid file: design concepts and experience, BIT, 25:569--592, 1985. E.G. Hoel, H. Samet: A qualitative comparison study of data structures for large line segment data bases, Proc. ACM SIGMOD Int. Conf. on Management of Data, 205-214(1992). K. Lin, H. Jagadish, and C. Faloutsos: "The TV-Tree: An Index Structure for High- Dimensional Data", VLDB Journal, 3, October 1994. V. Jain and B. Shneiderman. Data Structures for Dynamic Queries: An Analytical and Experimental Evaluation, Technical Report CAR-TR-715 CS-TR-3287, Dept. of Comp. Sci., U. of Maryland, July 1995. D. White and R. Jain: "Similarity indexing with the SS-tree In Proc. 12th IEEE Int. Conf. on Data Engineering, pages 516--523, New Orleans, Louisiana, 1996. Kamel and C. Faloutsos: "Parallel R -trees", CS-TR-2820, University of Maryland, College Park, 1992. 17. Kamel and C. Faloutsos: " packing R On -trees", Proc. 2nd Int'l. Conf. on Information and Knowledge Management, pp. 490--499, 1993. N. Katayama and S. Satoh: "The SR-tree: An index structure for highdimensional nearest neighbor queries," Proceedings of ACM SIGMOD, May 1997. Curtis Philip Kolovson: Indexing techniques for multi-dimensional spatial data and historical data in database management systems, University of California at Berkeley, Berkeley, CA, 1991. N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger: "The R* -tree: an efficient and robust access method for points and rectangles", Proceedings of ACM SIGMOD Conference, 1990. S. Berchtold, D.A. Keim, and H.-P. Kriegel: "The X-tree: An index structure for high-dimensional data" In Proceedings of the International Conference on Very Large Databases (VLDB), pages 28--39, 1996. Christian Bhm, Hans-Peter Kriegel: Efficient Bulk Loading of Large High-Dimensional Indexes, DaWaK 1999: 251-260. Peter Kuba: Data Structures for Spatial Data Mining, FIMU Report Series, FIMU-RS-2001-05, September 2001.

[Hoel92]

[Jaga94]

[Jain95]

[Jain96]

[Kame92]

[Kame93]

[Kata97]

[Kolo90]

[Krei90]

[Krei96]

[Krie99]

[Kuba01]

111

[Kuma94]

Kumar: "G-Tree: A New Data Structure for Organizing Multidimensional Data," IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No. 2, pp. 341-347, 1994. Master-client R -trees: a new parallel R -tree architecture, Schnitzer, B.; Leutenegger, S.T.; Scientific and Statistical Database Management, 1999. Eleventh International Conference on , Aug 1999 ,Page(s): 68 -77. Wang, S., Hellerstein, J.M., Lipkind, I: Near-neighbor query performance in search trees, Technical Report CSD-98-1012, UC Berkeley, 1998. David B. Lomet, Betty Salzberg: hB-Tree: A Robust Multi-Attribute Search Structure, ICDE 1989: 296-304. D. B. Lomet and B. Salzberg: "The hB-tree: A multi-attribute indexing method with good guaranteed performance" ACM Transactions on Database Systems, 15(4):625--658,December 1990 D. Lomet: A review of recent work on multi-attribute access methods, ACM SIGMOD Record, 21(3):56-63, 1992. Georgios Evangelidis, David B. Lomet, Betty Salzberg: The hB-Pi-Tree: A Multi-Attribute Index Supporting Concurrency, Recovery and Node Consolidation, VLDB Journal 6(1): 1-25(1997). W. Lu, H. Han: Distance-associated join indices dor spatial range search, Proc. 8th IEEE. International Conference on Data Engineering, 284-292, 1992. Parallel R -tree spatial join for a shared- nothing architecture, Mutenda, L. Kitsuregawa, M.; Database Applications in Non-Traditional Environments, 1999. (DANTE '99) Proceedings. 1999 International Symposium on , 2000, Page(s): 423 -43. R.C Nelson, H.Samet: A population analysis for hierarchical data structures, Proc. ACM Int. Conf on Management of Data, 270-277 (1987). Orenstein, J.: Multidimensional tries used for associative searching, Information Processing Letters 14(4), 150-157. D.Papadias, Y. Theodoridis, T. Sellis, M.J. Egenho fer: Topological relations in the world of minimum bounding rectangles: a study with R Trees, Proc. ACM SIGMOD Int. Conf. on Management of Data, 92-103 (1995).

[Leut99]

[Lipk98]

[Lome89]

[Lome90]

[Lome92]

[Lome97]

[Luha92]

[Mute00]

[Nels87]

[Oren82]

[Papa95]

112

[Papa97]

Papadopoulos Apostolos, Manolopoulos Yannis: Performance of Nearest Neighbor Queries in R-Trees, ICDT 1997: 394-408. Data Structures for Spatial Database Systems Octavian Procopiuc (1997). J.T.Robinson: The K-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes. Proc. ACM SIGMOD Conf. 198 i. Weber Roger, Schek Hans-Jrg, Blott Stephen: A Quantitative Analysis and Performance Study for Similarity-Search Methods in HighDimensional Spaces VLDB 1998: 194-205. D.Rotem: Spatial Join Indices, Proc. Of IEEE 7th International Conference on Data Engineering, 500-509 (1991). N. Roussopoulos, S. Kelley, and F. Vincent: Nearest neighbor queries, In ACM SIGMOD Proceedings, 1995. Hans Sagan: Space-Filling Curves, Springer Verlag, 1994. Yasushi Sakurai, Masatoshi Yoshikawa, Shunsuke Uemura, Haruhiko Kojima: "The A-Tree: An Index Structure for High- Dimensional spaces Using Relative Approximation". VLDB 2000:516-526. H. Samet: The Design and Analysis of Spatial Data Structures, AddisonWesley, Reading, MA, 1990. H. Samet: Spatial Data Structures, Reading, Addison-Wesley/ACM, 1995. H. P. Kreige l, M. Schiwietz, R. Schneider and B. Seeger: Performance comparison of point and spatial access methods, Design and Implementation of Large Spatial Database Systems, Number 409 in LNCS, Berlin/Heindelberg/New York, pp. 89-114, Springer-Verlag, 1990. B.Seeger, H. Kreigel:Techniques for design and implementations of efficient spatial access methods, Proc. 14th International Conference on Very Large Databases, Brisbane, Australia, 590-601 (1988). B. Seeger and H.-P. Kriegel: "The Buddy-Tree: An Efficient and Robust Access Method for Spatial Data Base Systems," Proceedings of the 16th VLDB Conference, pp. 590-601,1990 Performance Comparison of segment access methods implemented on top of the buddy-tree, In Advances in Spatial Databases, Number 525 in LNCS, Berlin/Heidelberg/New York, pp. 277-296, Springer-Veralg, 1991.

[Proc97] [Robi81]

[Roge98]

[Rote91]

[Rous95]

[Saga94] [Saku00]

[Same90]

[Same95]

[Schi90]

[Seeg88]

[Seeg90]

[Seeg91]

113

[Sell87]

T. Sellis, N. Roussopoulos, and C. Faloutsos, "The R + -tree: A dynamic index for multidimensional objects," in Proceedings of Very Large Data Bases, pp. 3--11, Brighton, England, 1987. J. Nievergelt, H. Hinterberger, and K.C. Sevcik: "The grid file: an adaptable, symmetric multikey file structure" ACM Transactions on Database Systems, Vol. 9, No.1, pp. 38-71, 1984 Wang, W., Yang, J., Muntz, R. R.: PK-tree: "A Spatial Index Structure for High Dimensional Point Data 5th Intl. FODO Conf. (1998) R. Weber, H.-J. Schek, and S. Blott: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces, In A. Gupta, O. Shmueli, and J. Widom, editors, Proc. of the 24st Int. Conf. on Very Large Data Bases, VLDB'98, Ney York City, August 24-27, 1998, pages 194-205. Morgan Kaufmann Publishers, San Francisco, CA, 1998. D.A.White, R.Jain: Similarity Indexing: Algorithms and Performance, Proc. SPIE Vol. 2670, San Diego, USA, pp. 62-73, Jan. 1996. J. Nievergelt and P. Widmayer: Spatial data structures: Concepts and design choices, In M. van Kreveld, J. Nievergelt, T. Roos, and P. Widmayer, editors, Algorithmic Foundations of GIS, pages 153-197. Springer-Verlag, LNCS 1340, 1997. GPR-Tree: a global parallel index structure for multiattribute declustering on cluster of workstations , Xiaodong Fu; Dingxing Wang; Weimin Zheng, Meiming Sheng; Advances in Parallel and Distributed Computing, 1997. Proceedings , 19-21 Mar 1997, Page(s): 300 -306. Q. Yang, A. Vellaikal, and S. Dao: "MB + -Tree: A New Index Structure for Multimedia Databases, " Proc. of the International Workshop on Multi-Media Database, pp. 151- 158, 1995 P. N. YIANILOS: Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces, Proceedings of the Fourth ACM-SIAM Symposium on Discrete Algorithms, January 1993.

[Sevc94]

[Wang98]

[Webe98]

[Whit96]

[Widm97]

[Xiao97]

[Yang95]

[Yian93]

114

APPENDIX A

115

Structure R-Tree R+-Tree R*-Tree SS-Tree SR-Tree UB-Tree M-Tree MB+-Tree TV-Tree G-Tree X-Tree PK-Tree

Source [Gutt84] [Sell87] [Krei90] [Jain96] [Kata97] [Baye96] [Ciac97] [Yang95] [Jaga94] [Kuma94] [Krei96] [Wang98]

Based on B-Tree R-Tree R-Tree R*-Tree R*-Tree SS-Tree B-Tree VP-Tree R-Tree B+-Tree R-Tree Grid File B-Tree R*-Tree PR-quadtree KD-Tree R-Tree Grid File KDB-Tree VP-Tree Signature Method VA-File SR-Tree

MBB Rect. Rect. Rect. Sphere Rect-Sphere None None None Sphere/ Diamond None Rect None Rect None None None None Rect-Sphere

Over. Yes No Yes Yes Yes No Yes No Yes No Yes No No No No No No Yes

Bal. Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes

Rect. Yes Yes Yes No No No No Yes No Yes Yes Yes Yes No Yes No No Yes

Comp . No No No No No Yes Yes Yes No No No Yes No Yes Yes Yes Yes No

Reinsert No No Yes Yes Yes No No No Yes No No No No No No No No Yes

Repr. Exact Exact Exact Exact Exact Exact Exact Exact Apprx. Exact Exact Exact Exact Exact Exact Exact Apprx. Apprx.

Data Dyn. Dyn. Dyn. Dyn. Dyn. Dyn. Dyn. Dyn. Dyn. Stat Dyn.

Search Paths Multi Single Multi Multi Multi Single Single Single Multi Single Multi Single Single Single Single Single Single Multi

Buddy Tree [Seeg90] hB-Tree Grid File MVP-Tree VA-File A-Tree [Lome90] [Seci84] [Boik99] [Blot98] [Saku00]

Stat. Dyn. Dyn. Dyn. Stat Dyn.

116

Structure: Name of the Structure Source: Reference Basic on: Basic structures. MBB: Geometrical region of the page. Over: Overlap in the data space partition Bal: Balancing nature of the structure Rect: Rectangular partition of the data space Comp: Completeness whether entire data space is partitioned Re- insert: Whether the insertion algorithm of the structure uses the concept of forced re-insert. Data: Nature of the dataset Dynamic/Static. Repr: Approximate/Exact representation of the objects. Search Paths: Number of paths traversed for exact match query.

117

You might also like