sdd/kiddo

Issue with rkyv, and perhaps a suggestion

Closed this issue · 17 comments

Issue

This may be a me problem or an rkyv problem, but I wanted to bring it up and discuss it here. The code below does not work when DATA is 512 * 512 * 512, but works with the other two sizes. May be unrelated but number of bytes in a 512^3 f32 tree are suspiciously close to (and over) the u32 limit.

use std::{
    error::Error,
    fs::File,
    io::{Read, Write},
};

use kiddo::{float::distance::squared_euclidean, KdTree};
use num_format::{Locale, ToFormattedString};
use rayon::prelude::{IntoParallelIterator, ParallelIterator};
use rkyv::{
    ser::{
        serializers::{AlignedSerializer, BufferScratch, CompositeSerializer},
        Serializer,
    },
    AlignedVec, Infallible,
};

type FloatType = f32;
fn main() -> Result<(), Box<dyn Error>> {
    // const DATA: usize = 100_000;
    const DATA: usize = 256 * 256 * 256;
    // const DATA: usize = 512 * 512 * 512;
    let data: Vec<[FloatType; 3]> = (0..DATA).map(|_| rand::random()).collect();

    let timer = std::time::Instant::now();
    let mut tree = KdTree::new();
    for (i, p) in data.iter().enumerate() {
        tree.add(p, i);
    }
    println!(
        "it takes {} micros to build tree",
        timer.elapsed().as_micros().to_formatted_string(&Locale::en)
    );

    let timer = std::time::Instant::now();
    {
        let mut file = File::create("ser_tree")?;
        serialize_to_rkyv(&mut file, tree.clone());
    };
    println!(
        "it takes {} micros to serialize and save tree",
        timer.elapsed().as_micros().to_formatted_string(&Locale::en)
    );

    let timer = std::time::Instant::now();
    let mut file = File::open("ser_tree")?;
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer)?;
    println!("loaded buffer");
    let archived_tree = unsafe { rkyv::archived_root::<KdTree<FloatType, 3>>(&buffer) };
    println!("deserd archived");
    let tree_desered: KdTree<FloatType, 3> = archived_tree.deserialize(&mut Infallible)?;
    let same_tree = tree_desered == tree;
    assert!(same_tree, "invalid tree deser");
    println!(
        "it takes {} micros to load and deserialize tree",
        timer.elapsed().as_micros().to_formatted_string(&Locale::en)
    );

    let queries: Vec<[FloatType; 3]> = (0..100).map(|_| rand::random()).collect();
    let timer = std::time::Instant::now();
    let result: Vec<_> = queries
        .into_par_iter()
        .map(|q| tree_desered.nearest_one(&q, &squared_euclidean))
        .collect();
    println!(
        "it takes {} micros to query the tree {} times",
        timer.elapsed().as_micros().to_formatted_string(&Locale::en),
        result.len()
    );

    Ok(())
}

fn serialize_to_rkyv(file: &mut File, tree: KdTree<FloatType, 3>)  {
    let mut serialize_buffer = AlignedVec::with_capacity(tree.size());
    let mut serialize_scratch = AlignedVec::with_capacity(tree.size());
    unsafe {
        serialize_scratch.set_len(tree.size());
    }
    serialize_buffer.clear();
    let mut serializer = CompositeSerializer::new(
        AlignedSerializer::new(&mut serialize_buffer),
        BufferScratch::new(&mut serialize_scratch),
        Infallible,
    );
    serializer
        .serialize_value(&tree)
        .expect("Could not serialize with rkyv");
    let buf = serializer.into_serializer().into_inner();
    file.write(buf.as_ref())
        .expect("Could not write serialized rkyv to file");
}

I actually get non-deterministic errors. Sometimes its a segfault, sometimes it's a LayoutError. I think the most likely explanation is that I am not doing the byte alignment correctly.

Suggestion

Provide a method for rkyv and serde ser/de of trees that simply takes in an impl AsRef<Path> with some reasonable serializer/deserializer parameters that takes care of all alignment issues.

sdd commented

In this case, instead of using the kiddo::KdTree export, try using float::kdtree::KdTree<A, usize, K, 32, u64>?

This will switch the internal indexing type inside the tree to a 64-bit index rather than 32. The vanilla kiddo::KdTree export is equivalent to float::kdtree::KdTree<A, usize, K, 32, u32> which only uses a 32 bit index internally.

I had tried this yesterday but I got the following error:

the function or associated item `new` exists for struct `KdTree<f32, usize, 3, 32, u64>`, but its trait bounds were not satisfied
the following trait bounds were not satisfied:
`<u64 as kiddo::types::Index>::T = u64`
`u64: kiddo::types::Index`

i.e., it appears that Index isn't implemented for u64, only for u32 and u16

Also, 512^3 is only 2^27

sdd commented

Aah, I see, I thought I'd implemented Index for u64 but it seems not. Even at 2^27, it's about an order of magnitude more items than I've tried to store in it before - I'll create some test cases that use that kind of quantity and fire up the debugger to see what's going on.

sdd commented

I tried again with a random dataset (3D f32 with u64 content). 72.5M items worked successfully but 73M causes a LayoutError consistently.

73M continued to cause a segfault even if I doubled the bucket size, and / or set IDX to u64 after implementing Index for u64, or increased the BUFFER_LEN and SCRATCH_LEN by 10x to 3Gb. This seems to indicate to me that the issue is with the number of points themselves rather than the number of stems or leaves, since the quantity of those
gets reduced below whatever threshold is breaking this by doubling the bucket size (to 1639091 stems / 1639092 leaves on one attempt).

The KdTree itself still seemed to work at 73M, as did the serialization itself (or at least, it didn't crash).

I tried looking at the size of the serialized tree on disk. The 72.5M-item-tree was 2,139,449,440 bytes and the 73M-item-tree was 2,147,479,552 bytes.

2Gb is 2^31 bytes, which is 2,147,483,648. So, going from 72.5M to 73M almost crossed 2^31 bytes. Suspicious!

I think this suggests Rkyv is the culprit. Perhaps the relative pointers that it uses are 32 bit signed, and at the point we reach a 2Gb file, it overfows?

sdd commented

No matter how much more items I add to the tree, over a certain point the serialized tree is always exactly 2,147,479,552 bytes. 100M items and 73M items both produce files with 2,147,479,552 bytes. Perhaps there is an internal buffer that is indexed by an i32 somewhere and the counter just wraps around when it goes past the end or something.

Interesting, maybe check in with the rkyv folk?

sdd commented

Aha, this looks relevant: rkyv/rkyv#209 (comment)

Looks like we need to enable the size_64 feature. (https://docs.rs/rkyv/latest/rkyv/#features)

Edit: I thought that was on already? https://github.com/sdd/kiddo/blob/master/Cargo.toml#L63

yea I had it on too in my example... maybe a bug?

I think I got it... we are using file.write not file.write_all, lol. The buffer returned by rkyv is bigger than ≈2^32.

    pub fn write(&self, buf: &[u8]) -> io::Result<usize> {
        let ret = cvt(unsafe {
            libc::write(
                self.as_raw_fd(),
                buf.as_ptr() as *const libc::c_void,
                cmp::min(buf.len(), READ_LIMIT), // <--- right here
            )
        })?;
        Ok(ret as usize)
    }

With this... one run gives

it takes 46,720 millis to build tree
it takes 1,892 millis to serialize and save tree
it takes 519 millis to load file into mem
it takes 750 nanos to archived_root

46s to build tree from scratch vs ≈519 millis to load and zc deser... ≈100x speedup for my use case lol

sdd commented

Awesome, great spot :-) I'll update the example.

Have you tried using mmap too?

sdd commented

This is all you need to do once you've installed memmap:

let mmap = unsafe { MmapOptions::new().map(&File::open("./examples/large-random-tree.rkyv")?)? };
let tree = unsafe { rkyv::archived_root::<Tree>(&mmap) };

Thats awesome. I haven't used mmaping that much, only once or twice for a npy file. How does mmaping affect query performance?

In any case, going back to the point of this issue: I think a load and save method with reasonable defaults would make the library very ergonomic for newcomers.

sdd commented

The memmap doesn't affect the query performance adversely at all. I just tried an experiment where I created a random KdTree with 250,000,000 * ([f32; 3] + u64) and serialized it to an rkyv file, which was 8Gb in size. Then a separate binary that memmaps it in and performs a query. I was getting times of ~80 to 100us to map in the file and perform the query. I should test the performance after a reboot, to see what happens when the file is definitely not in the OS's file cache.

You're completely correct re the load and save methods. ArchiveKdTree::load() and KdTree::save() would also help point people towards using rkyv correctly (unlike my original usage 😅 ). I'll get those added.

sdd commented

I'l close this one too, since we got to the bottom of the problem. Thanks for the input with this one as well.