splatfacto method in Colab broken?

Question

splatfacto method in Colab broken?

Closed this issue 3 months ago · 10 comments

fasteinke commented 3 months ago

Describe the bug
Running the demo.ipynb fails to start training

To Reproduce
Steps to reproduce the behavior:

Select an example datatset - here, desolation
Paste the simplest command into xterm: ns-train splatfacto --data data/nerfstudio/desolation
Gets to the setting up of CUDA, this will take a few minutes bit; cycles for quite a while, and then throws an error
xterm then goes nuts, and constantly prompts for input; the error message is lost

Previous attempts to use this method at least started training; now it has problems even earlier.

Answer 1 · 2024-09-29T02:22:31.000Z

That was fast!! I'm impressed ...

Answer 2 · 2024-09-29T02:29:58.000Z

Okay, now I'm confused ... these look to be files to allow me to run locally. But my issue is with how the notebook runs in Colab - do the files there need to be altered in some fashion?

Answer 3 · 2024-09-29T03:20:31.000Z

(deleted comment because malware, unfortunately I don't have experience with Colab so not the best person to help with the actual issue)

Answer 4 · 2024-09-29T04:45:50.000Z

That's very nasy!!! ... Looks like I need to be on the ball, with regard to GitHub responses - not something I was aware was happening ...

Answer 5 · 2024-10-01T04:31:37.000Z

To flesh this out, tried running it with the splatfacto-big method, just in case ...

Same error:

[03:15:38] Saving config to: outputs/desolation/splatfacto/2024-10-01_031537/config.yml experiment_config.py:136
Saving checkpoints to: outputs/desolation/splatfacto/2024-10-01_031537/nerfstudio_models trainer.py:142
Auto image downscale factor of 2 nerfstudio_dataparser.py:484
load_3D_points is true, but the dataset was processed with an outdated ns-process-data that didn't convert colmap points to .ply! Update the colmap
dataset automatically? [y/n]: y
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /root/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 233M/233M [00:00<00:00, 267MB/s]
╭─────────────── viser ───────────────╮
│ ╷ │
│ HTTP │ http://0.0.0.0:7007 │
│ Websocket │ ws://0.0.0.0:7007 │
│ ╵ │
╰─────────────────────────────────────╯
[03:16:08] Caching / undistorting eval images full_images_datamanager.py:230
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
[03:16:12] Caching / undistorting train images full_images_datamanager.py:230
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 559.3889
VanillaPipeline.get_train_loss_dict: 559.3837
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gsplat/cuda/_backend.py", line 83, in
from gsplat import csrc as _C
ImportError: cannot import name 'csrc' from 'gsplat' (/usr/local/lib/python3.10/dist-packages/gsplat/init.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '10']' returned non-zero exit status 1.

Answer 6 · 2024-10-01T04:53:47.000Z

It seems like gsplat is not installing/building correctly on Colab? Related: nerfstudio-project/gsplat#315

It's also possible that recent changes to gsplat will help, it's pinned to 1.3.0 in nerfstudio but since nerfstudio-project/gsplat#365 was merged there's now pre-built wheels:

cc @liruilong940607 but I think he's very busy these days + also doesn't use Colab.

Answer 7 · 2024-10-01T05:59:46.000Z

Thanks for looping me in @brentyi !

I did a quick test on colab (T4 GPU) and i was able to install the latest gsplat on it. So it might be just a issue in the previous version (though I can't think of what might cause this).

The colab: https://colab.research.google.com/drive/10HVUf6e8_pRrMj4cmQ5Xepoq6BdkJkav?usp=sharing

Answer 8 · 2024-10-02T11:48:03.000Z

Thanks for the input ... some progress made ...

Added cell to demo.ipynb, following the "Install Nerfstudio and Dependencies" cell:

!pip install gsplat==1.4.0 --index-url https://docs.gsplat.studio/whl

which appeared to work; uninstalled 1.3.0, installed 1.4.0.

But, this time a different error:

...
Trainer.train_iteration: 501.1180
VanillaPipeline.get_train_loss_dict: 501.1118
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gsplat/cuda/_backend.py", line 83, in
from gsplat import csrc as _C
ImportError: /usr/local/lib/python3.10/dist-packages/gsplat/csrc.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorESt8optionalIN3c1010ScalarTypeEES5_INS6_6LayoutEES5_INS6_6DeviceEES5_IbES5_INS6_12MemoryFormatEE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '10']' returned non-zero exit status 1.

Answer 9 · 2024-10-02T16:48:57.000Z

Hey, installing gsplat's prebuilt wheels works fine for me, see:

https://colab.research.google.com/drive/10HVUf6e8_pRrMj4cmQ5Xepoq6BdkJkav?usp=sharing

You need to figure out the torch and CUDA version in the system and choose the correct prebuilt wheel for gsplat.

Answer 10 · 2024-10-03T07:13:44.000Z

Thanks!!! ... Had a misunderstanding about using "pip install ... --index-url ..." - so, next round installed the correct version, and the processing kicked off nicely ...