VAST-AI-Research/TriplaneGaussian

very unstable

salier opened this issue · 7 comments

salier commented

I used cuda12.1 pytorch 2.1.2 and although it was successfully deployed, I have not yet successfully generated a model. I hope to receive help.

Contains the following errors:

CUDA kernel failed : no kernel image is available for execution on the device void group_points_kernel_wrapper(int, int, int, int, int, const float *, const int *, float *) at L:38 in D:\TriplaneGaussian\tgs\models\snowflake\pointnet2_ops_lib\pointnet2_ops_ext-src\src\group_points_gpu.cu
(It seems that my architecture does not support it) (Occasionally)

Traceback (most recent call last): File "", line 1, in File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 125, in _main prepare(preparation_data) File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 289, in run_path return _run_module_code(code, init_globals, run_name, File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\TriplaneGaussian\gradio_app.py", line 39, in model = TGS(cfg=base_cfg.system).to(device) File "D:\TriplaneGaussian\infer.py", line 94, in init self.load_weights(self.cfg.weights, self.cfg.weights_ignore_modules) File "D:\TriplaneGaussian\infer.py", line 50, in load_weights state_dict = load_module_weights( File "D:\TriplaneGaussian\tgs\utils\misc.py", line 37, in load_module_weights ckpt = torch.load(path, map_location=map_location) File "D:\TriplaneGaussian\env\lib\site-packages\torch\serialization.py", line 993, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "D:\TriplaneGaussian\env\lib\site-packages\torch\serialization.py", line 447, in init super().init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
(Abnormal model reading) (Occasional)

Traceback (most recent call last): File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1132, in _try_get_data data = self._data_queue.get(timeout=timeout) File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\queues.py", line 114, in get raise Empty _queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "D:\TriplaneGaussian\env\lib\site-packages\gradio\queueing.py", line 456, in call_prediction output = await route_utils.call_process_api( File "D:\TriplaneGaussian\env\lib\site-packages\gradio\route_utils.py", line 232, in call_process_api output = await app.get_blocks().process_api( File "D:\TriplaneGaussian\env\lib\site-packages\gradio\blocks.py", line 1522, in process_api result = await self.call_function( File "D:\TriplaneGaussian\env\lib\site-packages\gradio\blocks.py", line 1144, in call_function prediction = await anyio.to_thread.run_sync( File "D:\TriplaneGaussian\env\lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "D:\TriplaneGaussian\env\lib\site-packages\anyio_backends_asyncio.py", line 2134, in run_sync_in_worker_thread return await future File "D:\TriplaneGaussian\env\lib\site-packages\anyio_backends_asyncio.py", line 851, in run result = context.run(func, *args) File "D:\TriplaneGaussian\env\lib\site-packages\gradio\utils.py", line 674, in wrapper response = f(*args, **kwargs) File "D:\TriplaneGaussian\gradio_app.py", line 111, in run infer(image_path, cam_dist, only_3dgs=True) File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "D:\TriplaneGaussian\gradio_app.py", line 96, in infer for batch in dataloader: File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 630, in next data = self._next_data() File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1328, in _next_data idx, data = self._get_data() File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1294, in _get_data success, data = self._try_get_data() File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1145, in _try_get_data raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e RuntimeError: DataLoader worker (pid(s) 33816) exited unexpectedly Traceback (most recent call last): File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1132, in _try_get_data data = self._data_queue.get(timeout=timeout) File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\queues.py", line 114, in get raise Empty _queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "D:\TriplaneGaussian\env\lib\site-packages\gradio\queueing.py", line 456, in call_prediction output = await route_utils.call_process_api( File "D:\TriplaneGaussian\env\lib\site-packages\gradio\route_utils.py", line 232, in call_process_api output = await app.get_blocks().process_api( File "D:\TriplaneGaussian\env\lib\site-packages\gradio\blocks.py", line 1522, in process_api result = await self.call_function( File "D:\TriplaneGaussian\env\lib\site-packages\gradio\blocks.py", line 1144, in call_function prediction = await anyio.to_thread.run_sync( File "D:\TriplaneGaussian\env\lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "D:\TriplaneGaussian\env\lib\site-packages\anyio_backends_asyncio.py", line 2134, in run_sync_in_worker_thread return await future File "D:\TriplaneGaussian\env\lib\site-packages\anyio_backends_asyncio.py", line 851, in run result = context.run(func, *args) File "D:\TriplaneGaussian\env\lib\site-packages\gradio\utils.py", line 674, in wrapper response = f(*args, **kwargs) File "D:\TriplaneGaussian\gradio_app.py", line 111, in run infer(image_path, cam_dist, only_3dgs=True) File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "D:\TriplaneGaussian\gradio_app.py", line 96, in infer for batch in dataloader: File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 630, in next data = self._next_data() File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1328, in _next_data idx, data = self._get_data() File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1294, in _get_data success, data = self._try_get_data() File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1145, in _try_get_data raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e RuntimeError: DataLoader worker (pid(s) 33816) exited unexpectedly

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "D:\TriplaneGaussian\env\lib\site-packages\gradio\queueing.py", line 501, in process_events response = await self.call_prediction(awake_events, batch) File "D:\TriplaneGaussian\env\lib\site-packages\gradio\queueing.py", line 465, in call_prediction raise Exception(str(error) if show_error else None) from error

Then accompanied by various error free crashes,

I have checked the online version using A10G, and I am using 4090 and 64G memory, which should not be insufficient. But the process often stops
_4R_0 I ( I{95L`@) D0

We haven't tested it on Windows; the error seems to be from the pointnet2_ops_lib (https://github.com/erikwijmans/Pointnet2_PyTorch), we will find a Windows machine and test it later.

salier commented

Thank you very much. Almost all errors occur during the model generation phase. Several Python processes start and either an error occurs or there is no error but no feedback for a long time (no unresponse)

I was running into a similar issue and was able to resolve it by editing setup.py before installing pointnet2_ops to include the cuda arch I am using. It appears it's not configured to include architectures from anything newer than 20xx cards.

My steps:

  • Clone repo/setup fresh env
  • Proceed with installation until before the step Install pointnet2_ops
  • Edit line 19 of \tgs\models\snowflake\pointnet2_ops_lib\setup.py by referencing the chart here
    os.environ["TORCH_CUDA_ARCH_LIST"] = "5.0;6.0;6.1;6.2;7.0;7.5;"
    changed to
    os.environ["TORCH_CUDA_ARCH_LIST"] = "5.0;6.0;6.1;6.2;7.0;7.5;8.0;8.6;8.7"
    since I am using a 3090.
  • Save and continue installation as normal
salier commented

I have enabled the script with administrator privileges, but I still have this issue.
load model ckpt done.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 967, in rebuild_pipe_connection
handle = dh.detach()
File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\reduction.py", line 131, in detach
return _winapi.DuplicateHandle(
PermissionError: [WinError 5] 拒绝访问。

Probability includes the following issues

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "D:\TriplaneGaussian\env\lib\site-packages\gradio\queueing.py", line 456, in call_prediction
output = await route_utils.call_process_api(
File "D:\TriplaneGaussian\env\lib\site-packages\gradio\route_utils.py", line 232, in call_process_api
output = await app.get_blocks().process_api(
File "D:\TriplaneGaussian\env\lib\site-packages\gradio\blocks.py", line 1522, in process_api
result = await self.call_function(
File "D:\TriplaneGaussian\env\lib\site-packages\gradio\blocks.py", line 1144, in call_function
prediction = await anyio.to_thread.run_sync(
File "D:\TriplaneGaussian\env\lib\site-packages\anyio\to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "D:\TriplaneGaussian\env\lib\site-packages\anyio_backends_asyncio.py", line 2134, in run_sync_in_worker_thread
return await future
File "D:\TriplaneGaussian\env\lib\site-packages\anyio_backends_asyncio.py", line 851, in run
result = context.run(func, *args)
File "D:\TriplaneGaussian\env\lib\site-packages\gradio\utils.py", line 674, in wrapper
response = f(*args, **kwargs)
File "D:\TriplaneGaussian\gradio_app.py", line 111, in run
infer(image_path, cam_dist, only_3dgs=True)
File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\TriplaneGaussian\gradio_app.py", line 96, in infer
for batch in dataloader:
File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 630, in next
data = self._next_data()
File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1328, in _next_data
idx, data = self._get_data()
File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1294, in _get_data
success, data = self._try_get_data()
File "D:\TriplaneGaussian\env\lib\site-packages\torch\utils\data\dataloader.py", line 1145, in _try_get_data
raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 9944) exited unexpectedly

The above exception was the direct cause of the following exception:(It seems that it has not affected the operation)

Traceback (most recent call last):
File "D:\TriplaneGaussian\env\lib\site-packages\gradio\queueing.py", line 501, in process_events
response = await self.call_prediction(awake_events, batch)
File "D:\TriplaneGaussian\env\lib\site-packages\gradio\queueing.py", line 465, in call_prediction
raise Exception(str(error) if show_error else None) from error
Exception: None

Exception ignored in: <function _ConnectionBase.del at 0x000001FE5DC18790>
Traceback (most recent call last):
File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 137, in del
self._close()
File "C:\Users\Sariel\AppData\Local\Programs\Python\Python310\lib\multiprocessing\connection.py", line 282, in _close
_CloseHandle(self._handle)
OSError: [WinError 6] 句柄无效。

I was running into a similar issue and was able to resolve it by editing setup.py before installing pointnet2_ops to include the cuda arch I am using. It appears it's not configured to include architectures from anything newer than 20xx cards.

My steps:

  • Clone repo/setup fresh env
  • Proceed with installation until before the step Install pointnet2_ops
  • Edit line 19 of \tgs\models\snowflake\pointnet2_ops_lib\setup.py by referencing the chart here
    os.environ["TORCH_CUDA_ARCH_LIST"] = "5.0;6.0;6.1;6.2;7.0;7.5;"
    changed to
    os.environ["TORCH_CUDA_ARCH_LIST"] = "5.0;6.0;6.1;6.2;7.0;7.5;8.0;8.6;8.7"
    since I am using a 3090.
  • Save and continue installation as normal

Hi @salier, maybe try this and see if it is a solution?

salier commented

I was running into a similar issue and was able to resolve it by editing setup.py before installing pointnet2_ops to include the cuda arch I am using. It appears it's not configured to include architectures from anything newer than 20xx cards.
My steps:

  • Clone repo/setup fresh env
  • Proceed with installation until before the step Install pointnet2_ops
  • Edit line 19 of \tgs\models\snowflake\pointnet2_ops_lib\setup.py by referencing the chart here
    os.environ["TORCH_CUDA_ARCH_LIST"] = "5.0;6.0;6.1;6.2;7.0;7.5;"
    changed to
    os.environ["TORCH_CUDA_ARCH_LIST"] = "5.0;6.0;6.1;6.2;7.0;7.5;8.0;8.6;8.7"
    since I am using a 3090.
  • Save and continue installation as normal

Hi @salier, maybe try this and see if it is a solution?

I followed this operation and there were still errors.

Dear author, I am very interested in your work.
But some problems occurred while reproducing your code. My hardware and software parameters are: 3090, cuda12.1, python=3.8, pytorch=2.1.2. The deployment of the environment has been completed, but an error occurred during inference: 段错误 (核心已转储)

Here's what I do:
Step 1: Enter the command in the terminal: python infer.py --config config.yaml data.image_list=[test.jpg,] --image_preprocess, but an error is reported: 段错误 (核心已转储), as shown in the figure:
RQqNKQN1lo

Step 2: Print where there may be problems, trying to find which line of code has the problem. Finally, it is found that the error caused by this line of code : self.point_encoder = tgs.find(self.cfg.pointcloud_encoder_cls)(self.cfg.pointcloud_encoder) . Further exploration revealed that it was a problem with the module = importlib.import_module(module_string, package=None) code in tgs.find.
The code is located as follows:
34j8j6RmjZ
00jcyakoya

Could you give me some advice on this issue?

I would be very grateful if I could receive any help from you! ! !
Thank you again for such a great job! ! !