From 0.37 `buffers::buffer_from_description` is extremely slow

Question

From 0.37 `buffers::buffer_from_description` is extremely slow

Closed this issue 2 years ago · 8 comments

Hello! Thanks for this amazing crate, such high quality and essential for accessing Snowflake using Rust.

When trying to upgrade from 0.36.1 to 0.39.0 I stumbled upon an issue starting from 0.37.0 where the buffer allocation is extremely slow. Wondering if I am using it wrong or there is some underlying issue.

The use case is querying JSON data in TEXT columns (VARIANT in Snowflake terms). Therefore a larger max_str_len to cope with the possible extremes. Given by the amazing guide the batch sizes shouldn't be something out of the ordinary, except maybe the mentioned max_str_len.

I'm running on Ubuntu 20.4.

This is a minimal example to trigger it (on my machine, and the CI I am using.)

use odbc_api::buffers::{BufferDescription, BufferKind};

const MAX_STR_LEN: usize = 65536;

const fn query_table_buffer_description() -> [BufferDescription; 1] {
    [BufferDescription {
        kind: BufferKind::Text {
            max_str_len: MAX_STR_LEN,
        },
        nullable: false,
    }]
}

fn main() {
    for i in 1..8 {
        let max_buffer_size = 10_usize.pow(i);
        println!("Buffer size: {max_buffer_size}");

        let now = std::time::Instant::now();

        let buffer = odbc_api::buffers::buffer_from_description(
            max_buffer_size,
            query_table_buffer_description().into_iter(),
        )
        .expect("Error"); // For 0.36.1 remove this line.

        println!("Got my buffer in {}s", now.elapsed().as_secs_f64());
    }
}

For 0.36.1 this results in seeming like we only allocate a pointer which then grows to fit the use case? Or is the same allocation happening behind the scenes?

Buffer size: 10
Got my buffer in 0.0000336s
Buffer size: 100
Got my buffer in 0.0000057s
Buffer size: 1000
Got my buffer in 0.0000051s
Buffer size: 10000
Got my buffer in 0.0000049s
Buffer size: 100000
Got my buffer in 0.0000048s
Buffer size: 1000000
Got my buffer in 0.0000087s
Buffer size: 10000000
Got my buffer in 0.0000049s

Starting from 0.37.0 it results in

Buffer size: 10
Got my buffer in 0.0112486s
Buffer size: 100
Got my buffer in 0.1132033s
Buffer size: 1000
Got my buffer in 1.1229766s
Buffer size: 10000
Got my buffer in 11.4219948s
Buffer size: 100000
Got my buffer in 115.9940551s
Buffer size: 1000000
^C  // Based on the increase we could assume ~1 000 seconds here.

Thanks again for the amazing crate!

Mathias

Answer 1 · 2022-05-05T12:22:44.000Z

Hello @MathiasKindberg ,

thanks for the great description of the issue. This weekend is mothers day, so I won't make any promises about reproducing it soon, but based on the information you provided, it looks to me that the fallibale allocation of text and binary columns introduced in 0.37.0 had a performance impact I've not been aware of. Maybe I need to run some benchmarks as part of my test suite, otherwise I'll have a hard time catching these issues myself. Thanks for your feedback, I rely on that.

What would you think about having buffers::buffer_from_description which panics on out of memory, but is (supposedly) faster, and a buffers::try_buffer_from_description`, which can fail recover-ably from allocation errors (which was a desired features for some downstream artefacts)?

This way users could choose between speed or error handling.

Regards, Markus

Answer 2 · 2022-05-07T16:51:56.000Z

Hello @MathiasKindberg ,

Reintroduced infallibale allocation using buffers::buffer_from_description. I could however not reproduce the differences in performance you encounter. I would very much like your feedback in regard if the new version 0.40.0 helps you with your performance troubles, or if I misjudged the cause of the issue.

Regards, Markus

Answer 3 · 2022-05-09T09:21:21.000Z

Interesting. Maybe related to my computer being WSL2 so a VM and the CI being containers. It seems like the implementation from 0.37 actually claims the entire memory. Does it zero it? Using 0.36 and the completely ridiculous buffer size of 1 000 000 it claims ~610 GB of virtual memory for reference. But that isn't initialized until actually used so it's fine in a sense although maybe a tad brittle.

On the other hand, I don't think fallible allocation makes sense - unless you are the kernel. It ends up being a race against the OOM killer.

Running on 0.39.0 the code above with --release it gets killed before anything can detect the fallible allocation. So I think the performance issue might also be related to running in debug mode, even though it is not super fast in release either.

Buffer size: 10
Got my buffer in 0.00025s
Buffer size: 100
Got my buffer in 0.0020781s
Buffer size: 1000
Got my buffer in 0.0136342s
Buffer size: 10000
Got my buffer in 0.1395752s
Buffer size: 100000
Got my buffer in 1.095602s
Buffer size: 1000000 
Killed

Answer 4 · 2022-05-09T12:47:29.000Z

Hello @MathiasKindberg ,

thanks for the feedback. The constant time for the fast 0.36.x versions indicates that the operating system is likely "lying" about the memory being truly allocated. Probably the best thing to measure is the entire time to execute the query and fill the memory. Although the unfallibale version is probably much better in situation there the user just specifies a ridiculously large batch size, because it does not truly allocate the memory.

To be clear: In 0.40.0 the infallible is the default again. I wanted to verify your performance problem is gone with the new version.

Cheers, Markus

Answer 5 · 2022-05-12T06:10:21.000Z

@MathiasKindberg Is your performance problem gone? Can the issue be closed?

Answer 6 · 2022-05-13T21:56:18.000Z

Closing this issue for now, but please reopen if the issue persists. @MathiasKindberg . Also still happy to get feedback, wether the issues has been solved.

Answer 7 · 2022-09-07T14:17:16.000Z

@pacman82

Hey! Sorry forgot about this! Your fix solved the issue! It was also my fault for naïvely allocating too large buffers and not being cautious about it as you said which exasperated the issues to a much larger extent than they should have been.

My use case for the buffer was having a generic easy function for tests that didn't have to deal with any iteration of buffers, and then as you said rely on the system allocator to keep it quick and lean.

Thanks for the amazingly quick turnaround time and amazing to see the announcement of async support on the rust subreddit!

Answer 8 · 2022-09-07T16:38:13.000Z

One less mystery in the world. Thanks for following up!