Create my dataset but the error is “WARNING: You must download the raw data for the ‘my_own' dataset.”

Question

Create my dataset but the error is “WARNING: You must download the raw data for the ‘my_own' dataset.”

zeyu659 opened this issue 3 months ago · 1 comments

dear @drprojects ,
I am having problems creating my own dataset 'scaffolds' and running train.py. I have read the create your own dataset section and created 'scaffolds' in the format of the ‘scannet’ dataset. Created /src/datasets/scaffolds.py & scaffolds_config.py, /src/datamodules/scaffolds.py, configs/datamodule/ semantic/scaffolds.yaml, configs/experiment/semantic/scaffolds_11g.yaml.
A new folder /data/scaffolds/raw/ is created to save the ‘scaffolds’ dataset in it, the ‘train&val’ part is saved in the /train folder as the scannet dataset is all saved in /scan folder, and the ‘test’ part is saved in the /test folder, the content of which is that the sem_label of each point has already been matched with the xyzrgb matching in the form of a sceneXXXX_point_clouds_all.npz file as follows:

{self.root = ''/ssd1/gaozy/Code/superpoint_transformer/data/scaffolds''}/
        └── raw/
            ├── train/
            │   └── sceneXXXX_point_clouds_all.npz
            └── test/
                └── sceneXXXX_point_clouds_all.npz

However, when I try to run train.py, it will get the error as follow.
I looked at my /data/scaffolds folder, which does have my own data already saved in it following the above structure, and I'm not sure what went wrong to cause it to jump straight to when the code is run:

# /superpoint_transformer/src/datasets/base.py--line599
def download(self):
        self.download_warning()
        self.download_dataset()

By the way, /ssd1/gaozy/Code/superpoint_transformer/data/scannet also exists correctly and runs successfully.
Thank you in advance for your help and support.

**_The ERROR Output_**
[2024-06-23 13:06:57,681][__main__][INFO] - Instantiating datamodule <src.datamodules.scaffolds.ScaffoldsDataModule>
[2024-06-23 13:06:58,603][__main__][INFO] - Instantiating model <src.models.semantic.SemanticSegmentationModule>
[2024-06-23 13:06:58,981][__main__][INFO] - Instantiating callbacks...
[2024-06-23 13:06:58,982][src.utils.utils][INFO] - Instantiating callback <pytorch_lightning.callbacks.ModelCheckpoint>
[2024-06-23 13:06:58,991][src.utils.utils][INFO] - Instantiating callback <pytorch_lightning.callbacks.EarlyStopping>
[2024-06-23 13:06:58,995][src.utils.utils][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichModelSummary>
[2024-06-23 13:06:58,997][src.utils.utils][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichProgressBar>
[2024-06-23 13:06:58,998][src.utils.utils][INFO] - Instantiating callback <pytorch_lightning.callbacks.LearningRateMonitor>
[2024-06-23 13:06:59,000][src.utils.utils][INFO] - Instantiating callback <pytorch_lightning.callbacks.GradientAccumulationScheduler>
[2024-06-23 13:06:59,002][__main__][INFO] - Instantiating loggers...
[2024-06-23 13:06:59,002][src.utils.utils][INFO] - Instantiating logger <pytorch_lightning.loggers.wandb.WandbLogger>
[2024-06-23 13:06:59,166][__main__][INFO] - Instantiating trainer <pytorch_lightning.Trainer>
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[2024-06-23 13:07:07,673][__main__][INFO] - Logging hyperparameters!
/ssd1/gaozy/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/_vendored/force_pydevd.py:18: UserWarning: incompatible copy of pydevd already imported:
 /ssd1/gaozy/miniconda3/envs/spt/lib/python3.8/site-packages/pydevd_plugins/__init__.py
  /ssd1/gaozy/miniconda3/envs/spt/lib/python3.8/site-packages/pydevd_plugins/extensions/__init__.py
  /ssd1/gaozy/miniconda3/envs/spt/lib/python3.8/site-packages/pydevd_plugins/extensions/pydevd_plugin_omegaconf.py
  warnings.warn(msg + ':\n {}'.format('\n  '.join(_unvendored)))
wandb: Currently logged in as: 13309865896 (gaozeyu). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.17.2
wandb: Run data is saved locally in /ssd1/gaozy/Code/superpoint_transformer/logs/train/runs/2024-06-23_13-06-54/wandb/run-20240623_130711-w7sr5nc2
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run SPT-128
wandb: ⭐️ View project at https://wandb.ai/gaozeyu/spt_scaffolds
wandb: 🚀 View run at https://wandb.ai/gaozeyu/spt_scaffolds/runs/w7sr5nc2
[2024-06-23 13:07:18,591][__main__][INFO] - Starting training!
**[2024-06-23 13:07:18,602][src.datasets.base][INFO] - WARNING: You must download the raw data for the Scaffolds dataset.
[2024-06-23 13:07:18,603][src.datasets.base][INFO] - Files must be organized in the following structure:
[2024-06-23 13:07:18,603][src.datasets.base][INFO] - 
    /ssd1/gaozy/Code/superpoint_transformer/data/scaffolds/
        └── raw/
            ├── train/
            │   └── sceneXXXX_point_clouds_all.npz
            └── test/
                └── sceneXXXX_point_clouds_all.npz
        
[2024-06-23 13:07:18,604][src.datasets.base][INFO] - 
[2024-06-23 13:07:18,604][src.datasets.scaffolds][ERROR] - 
Scaffolds does not support automatic download.
Please place the dataset files in the correct structure in your /ssd1/gaozy/Code/superpoint_transformer/data/scaffolds/' directory and re-run.
The dataset must be organized into the following structure:

    /ssd1/gaozy/Code/superpoint_transformer/data/scaffolds/
        └── raw/
            ├── train/
            │   └── sceneXXXX_point_clouds_all.npz
            └── test/
                └── sceneXXXX_point_clouds_all.npz**
        

[2024-06-23 13:07:18,621][src.utils.utils][INFO] - Closing loggers...
[2024-06-23 13:07:18,621][src.utils.utils][INFO] - Closing wandb!
wandb:                                                                                
wandb: 🚀 View run SPT-128 at: https://wandb.ai/gaozeyu/spt_scaffolds/runs/w7sr5nc2
wandb: ⭐️ View project at: https://wandb.ai/gaozeyu/spt_scaffolds
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./logs/train/runs/2024-06-23_13-06-54/wandb/run-20240623_130711-w7sr5nc2/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.

Answer 1 · 2024-06-24T07:36:26.000Z

Hi @zeyu659

This is probably linked to how you have configured your all_base_cloud_ids.

Our BaseDataset class inherits from PyG's Dataset class. As explained here. The download() method is called when the Datatest could not find all the raw_file_names in the raw/ directory.

Have a closer look at the PyG documentation and our BaseDataset code to understand how the raw_dir and processed_dir work. In particular, for your situation, I would suggest you have a closer look at how the following work:

all_base_cloud_ids
all_cloud_ids
cloud_ids
raw_file_names