gnzlbg/slice_deque

I hit panic!("number of iterations exceeded!") sometimes

Closed this issue · 18 comments

I have an app that uses about 5 slice-deque buffers, each used from a separate thread and created at roughly the same time.

Sometimes I hit this panic:

panic!("number of iterations exceeded!");

roughly once in 10-20 startups of the app. Other apps are starting at the same time, so the computer resources are quite clogged. It is an Intel Celeron N2930 running Windows 10, so nothing really powerful.

Does this seem like a reason to increase MAX_NO_ALLOC_ITERS ? Or is there something suspicious going on? Would it be a bad idea to e.g. increase MAX_NO_ALLOC_ITERS a lot - like 1000x? I have no idea about how long it takes to allocate mmaped area..

Thanks for the report @dvtomas I will look into it!

PR #56 should help here, but I don't know if this will be a definitive solution. On SySV we try 10 times, so I've updated the windows platform code to do the same. In any case, this shouldn't panic here, but just return an error instead - it might well be that you still get a panic, but that should happen somewhere else.

OK, thanks, hope the PR will help. This kind of nondeterministic behavior can be quite unfortunate, when it happens once upon a time..

All other major OSes (macos, linux, *bsds, ios, android, ...) have APIs to solve this problem without race conditions. AFAICT it is impossible to solve this problem on Windows with the currently available Windows APIs. So I'm pretty sure you could open a bug in their bug tracker about this.

Until then, there is little I can do. I wanted to expose the experimental try_push and try_reserve API, which allows you to handle failure here yourself, or just loop forever. You might also be able to improve this by just increasing the priority of your process. But if your system is under heavy load from competing processes, you can get very unlucky while trying to grow a SliceDeque, so you probably want to avoid this as much as possible by conservatively reserving more memory upfront.

Thank you for the effort, explanation and heads up. If there was a simple way to make this configurable at compile time, perhaps through Cargo.toml features, I would set it to a larger value. But even with MAX_NO_ALLOC_ITERS=10 I think our application will work alright, we're not making space rockets.

The thing is, the Vec and VecDeque APIs that SliceDeque exposes do not propagate errors. That is, if allocation / reallocation fails (e.g. because the number of iterations is exceeded), then you get a panic! at some point.

For the max number of iterations on platforms that cannot implement SliceDeque without a race condition, there is no right number. Under high load, you might need to loop forever.

I will try to expose a try_push and try_reserve API, and you will be able to use those to loop for as long as you see fit, make that configurable at compile-time, run-time, loop forever if you want to, etc.

So the PR now also adds try_reserve and try_push_front/try_push_back APIs. These APIs (probably) never fail on systems without memory overcommit like windows. You can use them to attempt to allocate or add a value for as long as you want.

OK, the API looks quite nice, thank you. If I understand correctly, try_push_* will fail on Windows when MAX_NO_ALLOC_ITERS is exceeded, and there's a good chance that if I retry (perhaps a couple of times), I'll succeed eventually, is that correct? Is it so that there are also other ways try_push can fail, that don't have such big chance of recovery when retried, like when the RAM is actually exhausted? If so, would it perhaps be worth signalling the cause to the user by having several variants of Err(cause) instead of just unit error value?

OK, the API looks quite nice, thank you. If I understand correctly, try_push_* will fail on Windows when MAX_NO_ALLOC_ITERS is exceeded, and there's a good chance that if I retry (perhaps a couple of times), I'll succeed eventually, is that correct?

There is a chance that if you retry it will work, but I wouldn't say the chance is good if it has already happened 5 or 10 times.

Is it so that there are also other ways try_push can fail, that don't have such big chance of recovery when retried, like when the RAM is actually exhausted?

Yes, it can also fail if you exhaust physical memory, virtual memory, etc. I wonder which of the steps in the algorithm is actually failing in your case.

If so, would it perhaps be worth signalling the cause to the user by having several variants of Err(cause) instead of just unit error value?

I don't think this is worth it. Chances are that if you fill a Windows bug they will either fix it, or tell you how to implement this without a race condition so that you can submit a PR to this library with a race-free implementation. Even if the Windows team decides to never fix this flaw in their API, requiring all users to handle a second error condition that cannot technically happen on most operating systems wouldn't be reasonable IMO.

Yes, it can also fail if you exhaust physical memory, virtual memory, etc. I wonder which of the steps in the algorithm is actually failing in your case.

From a quick look into the algorithm, one of the two map_view_of_file calls, but that's probably nothing new to you.

If so, would it perhaps be worth signalling the cause to the user by having several variants of Err(cause) instead of just unit error value?

I don't think this is worth it. Chances are that if you fill a Windows bug they will either fix it, or tell you how to implement this without a race condition so that you can submit a PR to this library with a race-free implementation. Even if the Windows team decides to never fix this flaw in their API, requiring all users to handle a second error condition that cannot technically happen on most operating systems wouldn't be reasonable IMO.

I don't think I'll be filing an issue to the WIndows team. I work on Linux all the time and just cross-compile our app releases to Windows, I don't even know if WIndows has some sort of issue tracker :-(

As for more error variants, I imagine something like this, but with less horrible naming

enum AllocError {
  Soft,
  OOM
}

fn try_push_(..) -> Result<(), AllocError> ...

That would at least communicate the nature of problems that can happen within the try_push call, and I think that would have to be documented anyway. Users can just pattern-match on Err(_) if they don't want to handle the variants separately without any impact on ergonomics, but they have the possibility to handle it differently if they want to.

I like slice-deuqe a lot and I don't like Windows, but it's a platform our customers use, so I can't just disregard it with "can't happen on other OSes". But I don't have a strong opinion on this. Personally, I've been explained very well by you what exactly is going on, so I'll be fine with Result<(), ()> anyway. Its just that I think others should be informed that there are potential problems on Windows, and I think the finer-grained error communicates that better.

Users can just pattern-match on Err(_) if they don't want to handle the variants separately without any impact on ergonomics, but they have the possibility to handle it differently if they want to.

This is a good point, sorry that I'm a bit grumpy today.

So I've updated the PR, I called the non-OOM error branch "Other". There isn't really a good name for it, since too much stuff can fail while allocating memory, and OOM is just a tiny part of it.

It looks nice, thank you!

Hey,
do you plan to release new version with the PR merged anytime soon?

Oops, I forgot!

Published v0.1.15, let me know if it works for you!

It works perfectly so far, thank you!

Awesome, let me know if you find any more issues!