splatfacto method in Colab broken?
Closed this issue ยท 10 comments
Describe the bug
Running the demo.ipynb fails to start training
To Reproduce
Steps to reproduce the behavior:
- Select an example datatset - here, desolation
- Paste the simplest command into xterm: ns-train splatfacto --data data/nerfstudio/desolation
- Gets to the setting up of CUDA, this will take a few minutes bit; cycles for quite a while, and then throws an error
- xterm then goes nuts, and constantly prompts for input; the error message is lost
Previous attempts to use this method at least started training; now it has problems even earlier.
That was fast!! I'm impressed ...
Okay, now I'm confused ... these look to be files to allow me to run locally. But my issue is with how the notebook runs in Colab - do the files there need to be altered in some fashion?
(deleted comment because malware, unfortunately I don't have experience with Colab so not the best person to help with the actual issue)
That's very nasy!!! ... Looks like I need to be on the ball, with regard to GitHub responses - not something I was aware was happening ...
To flesh this out, tried running it with the splatfacto-big method, just in case ...
Same error:
[03:15:38] Saving config to: outputs/desolation/splatfacto/2024-10-01_031537/config.yml experiment_config.py:136
Saving checkpoints to: outputs/desolation/splatfacto/2024-10-01_031537/nerfstudio_models trainer.py:142
Auto image downscale factor of 2 nerfstudio_dataparser.py:484
load_3D_points is true, but the dataset was processed with an outdated ns-process-data that didn't convert colmap points to .ply! Update the colmap
dataset automatically? [y/n]: y
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /root/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 233M/233M [00:00<00:00, 267MB/s]
โญโโโโโโโโโโโโโโโ viser โโโโโโโโโโโโโโโโฎ
โ โท โ
โ HTTP โ http://0.0.0.0:7007 โ
โ Websocket โ ws://0.0.0.0:7007 โ
โ โต โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
[03:16:08] Caching / undistorting eval images full_images_datamanager.py:230
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
[03:16:12] Caching / undistorting train images full_images_datamanager.py:230
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 559.3889
VanillaPipeline.get_train_loss_dict: 559.3837
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gsplat/cuda/_backend.py", line 83, in
from gsplat import csrc as _C
ImportError: cannot import name 'csrc' from 'gsplat' (/usr/local/lib/python3.10/dist-packages/gsplat/init.py)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '10']' returned non-zero exit status 1.
It seems like gsplat is not installing/building correctly on Colab? Related: nerfstudio-project/gsplat#315
It's also possible that recent changes to gsplat will help, it's pinned to 1.3.0 in nerfstudio but since nerfstudio-project/gsplat#365 was merged there's now pre-built wheels:
- https://github.com/nerfstudio-project/gsplat?tab=readme-ov-file#installation
- https://docs.gsplat.studio/whl/gsplat/
cc @liruilong940607 but I think he's very busy these days + also doesn't use Colab.
Thanks for looping me in @brentyi !
I did a quick test on colab (T4 GPU) and i was able to install the latest gsplat on it. So it might be just a issue in the previous version (though I can't think of what might cause this).
The colab: https://colab.research.google.com/drive/10HVUf6e8_pRrMj4cmQ5Xepoq6BdkJkav?usp=sharing
Thanks for the input ... some progress made ...
Added cell to demo.ipynb, following the "Install Nerfstudio and Dependencies" cell:
!pip install gsplat==1.4.0 --index-url https://docs.gsplat.studio/whl
which appeared to work; uninstalled 1.3.0, installed 1.4.0.
But, this time a different error:
...
Trainer.train_iteration: 501.1180
VanillaPipeline.get_train_loss_dict: 501.1118
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gsplat/cuda/_backend.py", line 83, in
from gsplat import csrc as _C
ImportError: /usr/local/lib/python3.10/dist-packages/gsplat/csrc.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '10']' returned non-zero exit status 1.
Hey, installing gsplat's prebuilt wheels works fine for me, see:
https://colab.research.google.com/drive/10HVUf6e8_pRrMj4cmQ5Xepoq6BdkJkav?usp=sharing
You need to figure out the torch and CUDA version in the system and choose the correct prebuilt wheel for gsplat.
Thanks!!! ... Had a misunderstanding about using "pip install ... --index-url ..." - so, next round installed the correct version, and the processing kicked off nicely ...