/M-Tree

A data structure for efficient nearest-neighbor queries.

Primary LanguagePythonMIT LicenseMIT

An m-tree is a data structure which indexes objects according to their relative
distances. It is efficient for nearest-neighbor queries.

This implementation follows the content of the article
http://www.vldb.org/conf/1997/P426.PDF, with the following highlights:
 * The data structure is the same described in the article.
 * The algorithm for adding data had adaptations to handle with some corner cases.
 * The algorithm for removing data was designed from scratch, since it was not
   described on the article.
 * Both query algorithms (by-range and k-nearest-neighbor) were merged into one.

The query algorithm can have as criteria either the range or the maximum number
of resulting items, or both. The algorithm processes the results as needed, as
the resulting items are fetched. The results are returned in non-decreasing distance
from the query data parameter.

MTree is the core class that implements an m-tree. The data objects can be
any object that are understood by the distance function.
Some examples:
 * Data objects could be N-dimensional-space coordinates and the distance function
   could return the euclidean distance between two data objects.
 * Data objects could be regular strings and the distance function could calculate
   the edit (Levenshtein) distance between two strings.
The distance function must be provided by the user.
Two equal objects should not be added to the same MTree instance. There is no
validation regarding this and, if done, the behavior of the tree is undefined.

Given a data object type and a distance function, some aspects can directly
impact the performance of an m-tree: the constraints on the number of children for
the nodes and the support functions (split, promotion, and partition functions).
An m-tree is implemented as a tree (really!? :-P) and the minimum and maximum
number of children on each node can be customized. When the maximum capacity of
a node is exceeded, a split function is used to split the node, which is replaced
by two new nodes. The split function can be seen as being composed by a promotion
function and a partition function. The promotion function must choose (promote)
two children from the node that will be split. The partition function must partition
the set of children into two sets corresponding to the promoted children. The two
promoted children and their corresponding partitions will be used to create the
two nodes that will replace the exceeding node.
There are some pre-defined implementations for promotion and partition functions,
and the user can implement new ones at will.


The yet-to-be-implemented MTreeDict class wraps an MTree and a mapping structure,
so it helps on keeping track of objects already added. The data objects work
as keys for associated value objects, so MTreeDict can be used as a dictionary.


The yet-to-be-implemented MTreeMultiDict works like the MTreeDict, but allows
associating more than one value object to the same key. The number of values
associated to the same key are taken into account when performing queries
constrained by the number of results.



See the README file specific for each language:
 * README-py for Python.
 * README-cpp for C++.
	

For more information see:
	http://en.wikipedia.org/wiki/M-tree
	http://www.vldb.org/conf/1997/P426.PDF