mkeeter/fidget

Vector types for tree construction.

LukeP0WERS opened this issue · 14 comments

Yesterday I put together a crate which generates vector types for Tree based off of glam's generation template:
https://github.com/LukeP0WERS/fidget_math
This should make it a bit easier to create complex shapes, but since you mentioned you didn't want PRs that would add significant maintenance burden I just made it a separate crate for anyone that wants to use it. Right now however, using the vector types leads to a runtime error: STATUS_ACCESS_VIOLATION, which I assume is the result of things being dropped in the wrong way. I noticed TreeOp has a manual implementation of Drop, would I need to implement something similar for a struct like the one below?

pub struct Vec3 {
    pub x: Tree,
    pub y: Tree,
    pub z: Tree,
}

Also sorry to bother you with this, I know it's not really an issue with fidget so this probably isn't the best place to be asking for coding advice.

That's an interesting error, and I'd be curious to see a backtrace / simple reproduction case (using fidget as a crate). It sounds like a Window-flavored segfault, which shouldn't be possible to trigger from safe Rust; if there's something wrong in one of Fidget's unsafe blocks, I'd like to fix it!

(To answer your other question, the custom Drop implementation for TreeOp is to avoid blowing up the stack for deep trees, but anything containing the Tree shouldn't have to do anything special)

ok ill see if i can put something together which reproduces the error.

After attempting and failing to reproduce the error I realized the actual problem has nothing to do with my crate and the error occurred whenever I would run:

let grad = self
    .eval_grad_slice
    .eval(
        sub_tape.g_tape(&mut self.tape_storage),
            &[Grad::new(corner[0], 1.0, 0.0, 0.0)],
            &[Grad::new(corner[1], 0.0, 1.0, 0.0)],
            &[Grad::new(corner[2], 0.0, 0.0, 1.0)],
    )
    .unwrap();

For some reason the error was fixed when I offset the point to the center of voxels like this:

let grad = self
    .eval_grad_slice
    .eval(
        sub_tape.g_tape(&mut self.tape_storage),
            &[Grad::new(corner[0] + parent_size * 0.5, 1.0, 0.0, 0.0)],
            &[Grad::new(corner[1] + parent_size * 0.5, 0.0, 1.0, 0.0)],
            &[Grad::new(corner[2] + parent_size * 0.5, 0.0, 0.0, 1.0)],
    )
    .unwrap();

The error must have been introduced when I updated to the current version of fidget and forgot to offset the point. I'm pretty sure calling that function is what caused the error so that might be something to look into.

I'm assuming eval_grad_slice is a JIT evaluator? If so, that smells like a calling convention issue in Fidget's JIT, so I'll take a look.

It is a JIT evaluator yes.

Does the tape that you're evaluating use any non-inlined functions (i.e. anything implemented with calls to call_fn_unary or call_fn_binary here)?

I'm also curious whether the crate's unit tests fail on the register-chaos branch, which is an attempt to stress the calling convention.

I'm not currently using anything that implements call_fn_unary or call_fn_binary. I also tested the register-chaos branch and everything passed. I really wish I understood programming this low level so I could get to the bottom of this. Maybe I should take the time to learn assembly this summer.

Also it turns out offsetting the point actually doesn't fix anything, because I changed something unrelated to the gradient eval and the error is happening again. perhaps you're accessing the wrong register somewhere and it only leads to a crash depending on how the rest of the program is compiled or something like that?

oops hold on. I didn't test the register-chaos branch in release mode. it does fail in release mode:

failures:

---- jit::test::i_binary::modulo stdout ----
thread 'jit::test::i_binary::modulo' panicked at fidget\src\core\eval\test\interval.rs:852:21:
interval failure in 'modulo(reg, reg)': (NaN, -6.2831855) in ((NaN, NaN), (-6.2831855, -6.2831855)) => NaN not in (0, 6.2831855) (should be [NaN, NaN])
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- jit::test::i_binary::atan2 stdout ----
thread 'jit::test::i_binary::atan2' panicked at fidget\src\core\eval\test\interval.rs:852:21:
interval failure in 'atan2(reg, reg)': (NaN, -6.2831855) in ((NaN, NaN), (-6.2831855, -6.2831855)) => NaN not in (-3.1415927, 3.1415927) (should be [NaN, NaN])


failures:
    jit::test::i_binary::atan2
    jit::test::i_binary::modulo

just in case it helps i wanted to mention that I haven't had any errors since changing the code to this:

let gx = vec![Grad::new(corner[0] + parent_size * 0.5, 1.0, 0.0, 0.0)];
let gy = vec![Grad::new(corner[1] + parent_size * 0.5, 0.0, 1.0, 0.0)];
let gz = vec![Grad::new(corner[2] + parent_size * 0.5, 0.0, 0.0, 1.0)];

let grad = self
    .eval_grad_slice
    .eval(
        sub_tape.g_tape(&mut self.tape_storage),
        &gx,
        &gy,
        &gz,
     )
     .unwrap();

I guess it has something to do with the memory being stored in the heap idk.

It's hard to say; I suspect stack vs heap is a red herring and it just happened to change register allocation.

The last register bug I fixed was #121, which I diagnosed by

  1. looking at which register was invalid during the read (x0)
  2. tracing backwards through the code to see where that register was loaded from (x21)
  3. adding breakpoints before and after the JIT function and printing the x21 value, noticing that it was different (meaning we had failed to save + restore it)
  4. setting a memory watchpoint on the relevant stack address, to detect when it was being accidentally overwritten in the stack

If you want to investigate, you could start at step (3): adding int 3 to the JIT code here and here, running in a debugger (which will stop at that interrupt), printing all of the registers and seeing if any are failing to be preserved.

This inspired me to take a look at the gradient JIT again, and I noticed something suspicious – can you see if #125 fixes the issue?

So... it might be fixed? The initial weirdness of the problem comes from the fact that I could fix it by just changing unrelated things around like making the inputs vecs instead of slices or adding an extra field to a struct that has no effect on the generation code. I changed things back to taking the inputs normally with your fix and nothing broke and also slapped some random bullshit code in there to see if I could get the error to happen again and so far it's fine but I have no idea how I would thoroughly test this. I'm not a super experienced programmer (~ 2 years of experience) so programming this low level is not something I've dipped my toes in yet, but I am absolutely willing to learn. What tools do you use for register debugging like you mentioned in #111 (comment)?

It definitely makes sense that it's hard to reproduce. Looking at the stack layout here:

  • If there weren't any tape values backed up to the stack, writing to [rbp - 0x28] would overwrite xmm15
  • That means an error would only occur if the caller had a value in that particular register that was stored before + used after the JIT code executed
  • Whether that happens is down to the vagaries of LLVM

I was debugging using lldb, mostly either single-stepping through the program or disassembling entire functions and looking at what registers they used.