bennyguo/instant-nsr-pl

Can't get deterministic behavior with fixed seed

anonymous-pusher opened this issue · 2 comments

Hello and thank you for your amazing work.
I have issues getting deterministic behavior with the neus model while using a fixed seed. Each time I get completely different results (sometimes 1 to 2 test psnr differences when tested on novel views). I tried different ways of making the training behavior deterministic by adding pl.seed_everything(config.seed, workers=True) and Trainer(..., deterministic=True) in launch.py
I also added the lines:

    os.environ["PYTHONHASHSEED"] = str(seed_number)
    os.environ["PL_GLOBAL_SEED"] = str(seed_number)

    torch.backends.cudnn.benchmark = False
    

    torch.backends.cuda.matmul.allow_tf32 = False

    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    random.seed(seed_number)
    torch.cuda.manual_seed_all(seed_number)
    torch.use_deterministic_algorithms(True)
    torch.manual_seed(seed_number)
    torch.cuda.manual_seed(seed_number)
    np.random.seed(seed_number)

but still no chance.

I notice that during training, the behavior of different curves is similar but not exactly same values and there are other losses like mask loss that is completely different for the same code.

When debugging, I found that the weights initialization is actually the same for all runs but after like 3 iterations I start getting slightly different loss values.

Do you have an idea for why this is happening ?

Thank you

It seems that the non-deterministic behavior is coming from hash encoding in tinycudann. The lines that I add impact only torch operations, but not the hash encoding ones. I tested this by freezing the hash grid from getting updated, and the training was 100% reproducible.

Therefore, although the parameters of the hash tables are initialized in the same way, the update will be different for some reason.

Perhaps there are some non-deterministic operations in the cuda implementation. Any idea how to make the update of the hash reproducible ?