loading null functions on MacOS
Closed this issue · 11 comments
I followed the instructions to build and downloaded the banana folder to test it out. It steps through 0 till 2000 but the final ply model is empty. I see that when initially loading functions, it is returning null.
First, it successfully loads the images:
❯ ./opensplat /Users/pats/Downloads/banana -n 2000
Using MPS
Reading 14241 points
Loading /Users/pats/Downloads/banana/images/frame_00001.JPGLoading
/Users/pats/Downloads/banana/images/frame_00003.JPG
Loading /Users/pats/Downloads/banana/images/frame_00005.JPGLoading /Users/pats/Downloads/banana/images/frame_00008.JPG
Loading /Users/pats/Downloads/banana/images/frame_00015.JPG
Loading /Users/pats/Downloads/banana/images/frame_00010.JPG
Loading /Users/pats/Downloads/banana/images/frame_00013.JPG
Loading /Users/pats/Downloads/banana/images/frame_00014.JPG
Loading /Users/pats/Downloads/banana/images/frame_00004.JPG
Loading /Users/pats/Downloads/banana/images/frame_00002.JPG
Loading /Users/pats/Downloads/banana/images/frame_00006.JPG
Loading /Users/pats/Downloads/banana/images/frame_00016.JPG
Loading /Users/pats/Downloads/banana/images/frame_00009.JPG
Loading /Users/pats/Downloads/banana/images/frame_00011.JPG
It also seems to successfully load the libraries right after this, but then the load functions keep returning null:
init_gsplat_metal_context: loading '/Users/pats/Library/CloudStorage/OneDrive-Personal/Georgie/Polo/3DGS/OpenSplat/build/default.metallib'
init_gsplat_metal_context: loaded '/Users/pats/Library/CloudStorage/OneDrive-Personal/Georgie/Polo/3DGS/OpenSplat/build/default.metallib', functions: compute_cov2d_bounds_kernel, project_gaussians_backward_kernel, get_tile_bin_edges_kernel, rasterize_backward_kernel, map_gaussian_to_intersects_kernel, nd_rasterize_backward_kernel, project_gaussians_forward_kernel, compute_sh_backward_kernel, compute_sh_forward_kernel, nd_rasterize_forward_kernel
init_gsplat_metal_context: load function nd_rasterize_backward_kernel with label: (null)
init_gsplat_metal_context: load function nd_rasterize_forward_kernel with label: (null)
init_gsplat_metal_context: load function rasterize_backward_kernel with label: (null)
init_gsplat_metal_context: load function project_gaussians_forward_kernel with label: (null)
init_gsplat_metal_context: load function project_gaussians_backward_kernel with label: (null)
init_gsplat_metal_context: load function compute_sh_forward_kernel with label: (null)
init_gsplat_metal_context: load function compute_sh_backward_kernel with label: (null)
init_gsplat_metal_context: load function compute_cov2d_bounds_kernel with label: (null)
init_gsplat_metal_context: load function map_gaussian_to_intersects_kernel with label: (null)
init_gsplat_metal_context: load function get_tile_bin_edges_kernel with label: (null)
Step 10: 0.208454
.
.
.
cameras.json
is not empty but splat.ply
is completely empty and doesn't render online or even on the Mac Viewer. I have a Macbook Pro with M1 Pro 16G memory, running Sonoma 14.0
Does it work with --cpu
? If so, this might be some issue with the MPS backend/rasterizer.
Hey,
I'm having a very similar issue on my Macbook Air M2 8GB RAM. Everything builds and runs, however the .ply file is empty on the viewer. However, I just tried the banana on --cpu
with 500 iterations and it worked fine, but taking longer. Seems like a MPS issue.
Same issue here, with --cpu it works!
me too, seems like the gpu support on mac is still buggy
Same issue on Mac M2 metal version. As for the banana example, everything is OK until the 22th iteration, when the 1457th point becomes:
[ nan, nan, nan, 0. , 0. ,0. , 0.28498277, 0.28498277, 0.28498277, ...... , -1.3643292 , nan, nan, nan, nan, nan, nan, nan].
I think there exists probability the data lost its value and becomes nan, and we should pass the iteration if it happens.
I've been trying to train on an M1 max using the MPS gpu build options. Using the banana dataset with n=2000, the program outputs a Nan at some point. After this the training goes downhill, and produces very artifacted results.
Sometimes it encounters a nan and crashes immediately with the following message:
Step 390: 0.109648 Step 400: 0.124691 Step 410: nan element 0 of tensors does not require grad and does not have a grad_fn Exception raised from run_backward at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/autograd.cpp:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 52 (0x100a8ecbc in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 92 (0x100a8b8dc in libc10.dylib) frame #2: torch::autograd::run_backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, bool, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, bool, bool) + 1228 (0x10f945290 in libtorch_cpu.dylib) frame #3: torch::autograd::backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, std::__1::optional<bool>, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&) + 96 (0x10f944628 in libtorch_cpu.dylib) frame #4: torch::autograd::VariableHooks::_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, std::__1::optional<at::Tensor> const&, std::__1::optional<bool>, bool) const + 384 (0x10f995674 in libtorch_cpu.dylib) frame #5: at::Tensor::backward(at::Tensor const&, std::__1::optional<bool>, bool, std::__1::optional<c10::ArrayRef<at::Tensor>>) const + 248 (0x1007b6194 in opensplat) frame #6: main + 16752 (0x1007b25b0 in opensplat) frame #7: start + 2840 (0x1996f0274 in dyld)
The problem is narrowed down to gsplat_metal.metal, where the calculation produces nan sometimes.
I am not familiar with metal programming, but "1.f / (1.f - alpha)" is highly suspected. I add "alpha < 0.99f" in line 962, and the banana example can produce right ply file (n=1000).
That's awesome @zctu ! Thanks for sharing your findings.
Would you be interested in opening a PR to fix this? 🙏
You did, thanks!