sdd/kiddo

Too many items with the same position on one axis

Opened this issue Β· 15 comments

Hello, Scott.

I am using kiddo version="2.1.1", while running I get the error:

panicked at 'Too many items with the same position on one axis. Bucket size must be increased to at least 1 more than the number of items with the same position on one axis.': /home/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/kiddo-2.1.1/src/float/construction.rs:223

I create KdTree like this

use kiddo::KdTree;
let patterns: KdTree<f32, 3> = KdTree::with_capacity(15000);

Before there was a panic, about 2500-3000 points of this kind were added:

[0.375, 0.02111614, 0.6641945]
[0.4, 0.03517097, 0.7697125]
[0.8333333, 0.29152542, 0.79504514]
[0.39285713, 0.31456685, 0.68769246]
[0.4027778, 0.7054616, 0.17576419]
[0.4, 0.61720204, 0.12553063]
[0.7916667, 0.768097, 0.0121426955]
[0.6458333, 0.84298575, 0.010071943]
[0.35416666, 0.84498125, 0.009049486]
[0.3611111, 0.0, 0.032477904]
[0.75, 0.0, 0.0]
[0.6, 0.018008512, 0.15822956]
[0.20833333, 0.020134605, 0.19604518]
[0.6111111, 0.02478464, 0.21736293]
[0.22222222, 0.028708035, 0.21946773]
[0.6354167, 0.028441446, 0.26566488]
[0.32142857, 0.033044748, 0.24640274]
[0.6770833, 0.05056776, 0.2594764]
[0.35, 0.032584045, 0.23101737]
[0.37121212, 0.034161422, 0.10082746]
[0.70238096, 0.03304547, 0.10467288]
[0.34375, 0.01344555, 0.8041778]
[0.8333333, 0.004752005, 0.0]
[0.375, 0.28213167, 0.60720414]
[0.3888889, 0.0, 0.21019818]
[0.4, 0.0, 0.0]
[0.875, 0.13517289, 0.0]
[0.39583334, 0.14556533, 0.15978695]
[0.8055556, 0.12474227, 0.17170732]
[0.3690476, 0.080061756, 0.52134496]
[0.3690476, 0.117475726, 0.6248968]
...

What capacity should I choose if I expect to add no more than 15000 points? How to calculation it correctly? Why can't you dynamically expand to the right size?

It's a bit discouraging, it's not at all obvious behavior. At first I use KdTree::new() but very quickly got the a similar error, switching to creation via with_capacity also does not completely solve the problem.

In any case, thanks for your library, I use it in my home project to predict the trade price.

Sincerely, Valery

sdd commented

Hi Valery,

Sorry to hear you're having problems. The issue that you're experiencing looks like it is happening because your dataset has lots of points that have values of 0.0.

When you use Kiddo via the kiddo::KdTree import, you will get a k-d tree with the default bucket size of 32. kiddo::KdTree is a type alias, defined in lib.rs like this:

pub type KdTree<A, const K: usize> = float::kdtree::KdTree<A, usize, K, 32, u32>;

You can create a k-d tree with a larger bucket size, which should help. Something like this:

const BUCKET_SIZE: usize = 128;   // buckets 4 times larger than the default

use kiddo::float::kdtree::KdTree;
let patterns: KdTree<f32, usize, 3, BUCKET_SIZE, u32> = KdTree::with_capacity(15000);

Kiddo's k-d tree stores items in 'buckets' in the leaf nodes of the tree. The bucket size is fixed at compile time, but the number of leaves grows as you add items. When an item is assigned to a new bucket, that bucket gets split and its contents get shared with a new leaf node.

When a split happens, a value is chosen for a new stem node that is used to sort the contents into one bucket or another. For example, lets assume a simple 1-d tree, with a bucket size of 6, and a leaf containing these values:

| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
+---+

If we then add 7 to the tree, that would overflow the bucket, so we split it. We sort the bucket and split it so that the first half of the items stay in our original bucket and the second half go to the new bucket

|   |  |   |
|   |  |   |
|   |  |   |
| 1 |  | 4 |
| 2 |  | 5 |
| 3 |  | 6 |
+---+  +---+

We then pick the smallest value in the new bucket to use as our new pivot value, which we use to determine which bucket any added items go into. In this case, the pivot value will be 4.

Now we can add 7 to the tree. Since it is >=4, it will of course go in the second bucket.

But, problems arise if you have lots of items with the same value. Lets say we had a bucket that looked like this, with two 3's in it:

| 1 |
| 2 |
| 3 |
| 3 |
| 5 |
| 6 |
+---+

And we want to add an item, triggering a split. If we apply the same logic as before, we end up with buckets like this:

|   |  |   |
|   |  |   |
|   |  |   |
| 1 |  | 3 |
| 2 |  | 5 |
| 3 |  | 6 |
+---+  +---+

With the pivot value set to 3. But now though, we have a 3 in the wrong bucket. If you tried to find it by navigating from the pivot, you wouldn't get to it, because the pivot tells you that all the items >=3 are in the right bucket.

So, we need to change our split algorithm to check if the split point lands in between two items with the same value. If this happens, we move the split point to the left until we no longer land in between two items with the same value. Our buckets end up split like this:

|   |  |   |
|   |  |   |
|   |  | 3 |
|   |  | 3 |
| 1 |  | 5 |
| 2 |  | 6 |
+---+  +---+

This is not quite optimal, but things don't break. But, if we get too many items with the same value in one bucket, and we try to add another, we end up with a problem:

| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
+---+

How can we split this bucket in half? We can't. There's no value that we can choose to pivot on to end up with some of the items in another bucket.

This is what is happening with your dataset. You have a lot of zeroes in your data. Kiddo has a fixed bucket size. If a bucket gets full of items with the same value on one dimension, and we try to add another item that ends up in that bucket, then we can't split the bucket, and so Kiddo has to panic.

Your only option is to increase your bucket size to a size large enough so that you don't end up with a bucket full of items with the same value. Larger buckets can lead to worse performance, but it may not be noticeable, depending on your use case. Probably not noticeable for a tree with just 15k items.

Hi, Scott.

Thank you for such a detailed and thorough answer. Apparently the problem is in my data, I need to somehow pre-process it. I need to think.

This is a great explanation, maybe it's worth adding it to the documentation for the library?

Hi, Scott.

It seems that panic is still not the best solution. I can't assure the invariant on unique values on one axis (or that values not so frequency repetitive), it's possible in theory but it's tricky. I have real-time data and there are a lot of them, it’s not even clear how the business logic needs to be adjusted in order to painlessly discard part of the data. Might be worth using Result to pub fn add(&mut self, query: &[A; K], item: T)?

P.S. The option with BUCKET_SIZE = 128 helps, but not always, it does not guarantee the correct work of the library with my data. Sooner or later, panic happens anyway.

UPD. Sorry, maybe I'm wrong, there is an error in my calculations, I'm generating a lot of zero values, i.e. My data is not correct.

Hi, Scott.

My calculation logic can lead to a series of zero values. It was not a mistake, just such a situation in the market (low volatility). I made filtering for such values, i.e. if null values are encountered, they are not added in the model at all. It seems to help, but still, I'm not sure. I cannot predict which values will be calculated and whether there will be any repeating values among them. My suggestion to add Result as the output of the add function still stands. It seems that panic is too hardcore)

Sorry for my endless posts, but this is a real experience using the library.

sdd commented

Hi @vchugreev,

No need to apologise, I'm always happy to hear from people using Kiddo. Sorry to hear that it wasn't useful for your use case! Yes, I'll consider switching the add method to return a Result. I used to have it that way in v0.x. It is certainly more ergonomic than a panic, I agree.

Hi, @sdd

The library is definitely useful, I've used it before, I'm using it right now and plan to use it in the future. I'm just afraid to use this in a production solution. If money is at stake (in my case it is, I have a fintech project), then it's just risky.

In any case, thanks for your work. You are the author and you decide how to develop it further. I hope that over time the add function will be safe.

Just chiming in to add my experience, I'm having this issue too. My data is roughly 13k points in 3D space taken with a laser scanner. The points contain a precision ground flat surface that happens to be aligned so that a bunch of y values end up clustered around each other, but none of the y values are identical. There is at least 2e-7 between each y value and only about 900 of the 13k are less than 1e-4 from their sorted neighbor.

This is what the distribution of y values looks like in my data, which appears to be what's causing the error.

image

I'll try experimenting with the bin size, but I'm working on a library and I'm not sure I'll be able to control what the data looks like that it ends up being used on. Based on how your underlying implementation works, are there values I can pick for for bin sizes at compile time that would function for data with millions or hundreds of millions of points that might have this kind of shape to it, or do you really have to know a lot about the size and shape of your data beforehand to use this library?

Thanks!

sdd commented

Hi Matt,

I've got an idea for how this could be addressed. Essentially it would involve having two sets of leaf nodes. The first type would be the fixed-length ones that we have now, that provide better performance (due to their size being fixed at compile time so that a list of them doesn't require an extra indirection to a Vec). The second type would be dynamically sized buckets that would be slightly slower to access due to their contents being a Vec rather than an Array.

The stem nodes would use the most significant bit of the leaf index to indicate the leaf index type. This would be quick to check, and the size of the stem nodes would not need to be bigger in memory, keeping more of them in the CPU cache to help with performance. This would mean that the max quantity of standard leaf nodes is halved, but since the type of this index is configurable to be a u16, u32 or u64, that's not necessarily a problem. With u16's and a bucket size of 32, that gives us a capacity of 2^15 * 2^5 = 2^20 (~1M) items. Switching to u32 with bucket size 32 gives us room for 2^31 * 2^5 = 2^36 (nearly 700B) items. I think that's more than enough for anyone πŸ˜†

This would mean that the library would work for any scenario, with performance gradually degrading, the further the dataset is from an ideally suited distribution.

sdd commented

I'm already working on another two alternate types of KdTree that will be available within the library that are optimised for certain scenarios. I could keep the existing highest-performance tree and add another type more suited for your use-case.

At the same time I intend to introduce a KdTree trait that each of these variants implement, to make it easier for anyone that wants to experiment by switching between trees that have differing underlying implementations.

Currently, the splitting dimension is based on the depth of each stem node:

split_dim = (split_dim + 1).rem(K);

Unfortunately, as discussed above, this approach does not work for many datasets.

I suggest to change the implementation so that the split dimension is incremented by one in a loop until the data can be split. This would slightly increase the size of each stem node as it would have to store the index of the splitting dimension. IMHO the overall performance impact should be minimal.

I suggest to change the implementation so that the split dimension is incremented by one in a loop until the data can be split. This would slightly increase the size of each stem node as it would have to store the index of the splitting dimension. IMHO the overall performance impact should be minimal.

This solution still does not account for some corner cases, for example having the points placed on an axis aligned cross.

sdd commented

This has the downside of increasing the size of each stem node from what is currently just a single u16 or u32 to also contain probably a u8 as well. A lot of effort was put into ensuring that these stem nodes are as small as possible in order to fit as much of the interior nodes into the CPU cache as possible, to ensure the best performance, and to stay in a higher performing regime with larger numbers of items stored.

I would expect this to result in a performance hit in a proportion of use cases. Memory alignment could affect how much of an effect this has in reality.

Another option could be to steal more bits from the top end of the stem node values. If u32's were being used for the index, it would be possible to get away with taking away 2 or 3 bits (for up to 4D or 8D datasets respectively) of that u32 to store the split dimension. Kiddo already uses 1 bit to indicate that the child is a leaf node. With the standard bucket size of 32, this would still let you store 2^28 * 32 or 2^29 * 32 items - i.e. 8 or 16 billion, likely way more than anyone is practically using anyway. This could cause some users to have to shift from u16 index to u32 index though. Doing this with u16 index would mean that you'd need to switch to u32 once you reached 2^13 * 32 (~262k items) for 3 or 4 dimensions or 2^12 * 32 (~131k items), rather than the current ~1M items, so for anyone who was storing more than 130k but less than 1M items, they'd see a potential performance hit (in practice this may not be that large) plus an increase in the size of any trees serialized to disk and in memory (again, not a huge increase, as most of the space is taken up by the leaves, if you use the standard bucket size).

Another possibility is to ignore the fact that the tree may become unbalanced (with max difference of depth being Dx where D is dimension) and divide full leaf multiple times until all children are not over capacity. The only requirement would be that not too many items must lie on the same coordinates.

I would like to chime in with my experience. I really like this library and I appreciate the work that has gone into it. But this issue is unfortunately a dealbreaker: the fact that, given the wrong set of data the program can outright panic, makes it unsuitable for use in any public library or in production.

In my case, which is computer graphics photon mapping, having thousands of photons on a flat plane (with a coordinate value in common) is very typical, which unfortunately completely rules out this library. It's a shame, because the performance and ergonomics are otherwise best in class.

Maybe another KD-tree type that heap-allocates instead of stack-allocates could solve this use case (or only heap allocates in case of bucket overflow, with something like smallvec). Or a method of splitting the bucket at an arbitrary point, and recording that this split is indeed arbitrary and both buckets need to be checked on access. Or manual assignment of an extra value to equal coordinates, to assign an order to data points with otherwise equal values.

I understand that these all involve a trade-off of either memory or performance, but in my view, squeezing 5% more performance for a tree that may panic when looked at the wrong way isn't a worthwhile compromise.

acshi commented

As an intermediate solution to other tree types with different performance tradeoffs, just adding a fallible try_add function would be more than enough for my use-cases.