Error During Data Pre-processing on Custom MLS Dataset
Closed this issue · 10 comments
Hello there @drprojects, @rjanvier, @loicland, @CharlesGaydon ! Its very nice to see a very well documented, state-of-art architecture which is user-friendly when it comes to setting up and running. Thanks for your work on the Superpoint Transformer.
We (@pyarelalchauhan, @xbais) are trying to train the architecture on a custom dataset collected in India. We have prepared the dataset as Binary PLY files similar to those in the DALES Object dataset (please see the header of one of our files attached below):
We have generated the relevant Configuration files and other Python files for our dataset taking inspiration from similar files available for DALES and S3DIS datasets provided in your repository. Some of the changes we have made according to our dataset are in these directories:
/configs/datamodule
: added our custom YAML fileconfigs/experiment
: added relevant YAML files for our dataset/data/
: addedcustom_data/raw/train
andcustom_data/raw/test
/src/datamodules
: added relevant Python file for our dataset./src/datasets/
: added relevantcustom-data.py
andcustom-data_config.py
files
We have read the posts #32 (related to RANSAC), #36 (in which you talk about the parameters voxel
, knn
, knn_r
, pcp_regularization
, pcp_spatial_weight
, pcp_cutoff
). But we are still facing issues. It will be greate if you can help us out here!!
👉 Regarding Errors and Warnings
We are getting the following errors and warnings which we are unable to resolve at the moment :
- Warning in Sckit-Learn Regression :
- NAG Related Issue :
Cannot compute radius-based horizontal graph
:
- ValueError
min_samples
may not be larger than number of samples: n_samples = 2 :
(Following your advice on #32 , we have already removed "elevation" frompartition_hf
andpoint_hf
, but still could not get the training to start. - Torch.cat() : expected a non-empty list of Tensors
👉 Regarding Understanding the Configuration
Could you also explain the significance of the value pcp_regularization
, pcp_spatial_weight
and pcp_cutoff
parameters in the /configs/datamodule/custom_data.yaml
file.
We are currently using the following configuration values :
We have tried tweaking these, but cannot get beyond the processing stage for our dataset. Tweaking these params gives one or more of the above mentioned errors and warnings at different stages of processing. Kindly help.
PS : We have already ⭐ ed your repo 😉
Also, it seems that you have used 20
as a factor to normalize the elevation for DALES and Kitti360, and you have used 4
for S3DIS. (reported in your research paper), can you please share how these were calculated, so that we can use this information to calculate it for our own dataset.
Can we find the z-range (difference between lowest and highest z value in point clouds) for our dataset and use it as the normalizing factor?
Hi @xbais @pyarelalchauhan ! Thanks for your interest in the project and for this clear and detailed issue. I can tell you searched through existing issues before filing one, I appreciate it 😉
👉 Regarding Errors and Warnings
It seems to me that all these errors may be pointing to the same thing: one of your clouds is too small. Make sure that you do not have dubious point clouds with like 1 or 2 points only. Are you using xy_tiling
or pc_tiling
to tile your clouds as a preprocessing step ? If so, inadequately setting these values may produce spurious tilings.
Here is how tiling works. Tiling is optional, if you do not need it, keep xy_tiling=None
and pc_tiling=None
. Tiling arguments can either be XY tiling or PC tiling but not both. XY tiling will apply a regular grid along the XY axes to the data, regardless of its orientation, shape or density. The value of xy_tiling
indicates the number of tiles in each direction. So, if a single int is passed, each cloud will be divided into xy_tiling**2
tiles. PC tiling will recursively split the data wrt the principal component along the XY plane. Each step splits the data in 2, wrt to its geometry. The value of pc_tiling
indicates the number of split steps used. Hence, 2**pc_tiling
tiles will be created.
custom dataset collected in India
You said you removed "elevation" from partition_hf
and point_hf
, are you sure you do not have any ground or floor in your dataset ?
PS: I can hardly read your screenshots. Next time, please favor sharing the full traceback like so:
your python traceback goes here
👉 Regarding Understanding the Configuration
See my reply in #50 for interpreting these parameters. Let me know if this is not clear enough.
Before tweaking the partition parameters, I would recommend fixing the above-mentioned errors, which seem related to spurious 1-point point clouds.
Regarding the parameters for the ground plane search with RANSAC
Also, it seems that you have used 20 as a factor to normalize the elevation for DALES and Kitti360, and you have used 4 for S3DIS. (reported in your research paper), can you please share how these were calculated, so that we can use this information to calculate it for our own dataset.
As mentioned in #32, the GroundElevation
will use RANSAC to approximate the ground or floor as a plane. This is often needed because different cloud acquisitions have different Z ranges, due to differences in altitude. We do not want the model to learn to reason on absolute Z values, but on elevation wrt to the local ground/floor. Hence my above question: are you sure you do not need the elevation in your dataset ?
That being said, if you have a look at the documentation for GroundElevation
you will see that ground_threshold
is used to guide ground search to points within [z_min, z_min + ground_threshold]
. This is a heuristic to accelerate the RANSAC algorithm and to avoid erroneously fitting the ground plane on large above-ground structures.
ground_scale
, on the other hand is used to scale the computed elevation
(ie Z-distance to the fitted ground/floor plane). We want our model's input features to live withing similar ranges (eg [0, 1]
or [-1, 1]
), so we scale the elevation
with a rough approximate of the maximum elevation in a setup. This is why we set ground_scale
to 4 for indoor scenes and 20 for outdoor scenes. You may want to adapt this is you are dealing with, say, an urban environment with 100-meter-high skyscrapers.
That was really helpful @drprojects !
Putting xy_tiling=null
really solved the errors during processing, and now we are able to complete the two-step processing. But we are facing the following issues:
👉 Unable To Change Training Device (GPU)
I have a server that has 3 GPU devices (with RAMs : 40 GB, 80 GB and 80 GB). By default processing and training both use the first GPU. But this led to segmentation fault so I had to change the device to cuda:2
in /configs/datamodule/my_data.yaml
. This prevented segmentation fault during processing. But at the onset of training, the OOM error props up immediately, and the device being used is device 0
(the same error also pops up if we use all 3 devices becuase I think the smallest GPU goes out of memory and terminates the entire training). It appears that the distributed processing automatically takes the first n
GPUs where n
is specified in the gpu.yaml
file.
✔️ Although, we are able to train the architecture on our dataset using all 3 GPU devices by setting devices: 3
in /configs/trainer/gpu.yaml
but ...
❌ ... but we cannot find a way to specify specific GPU(s) device(s) for training (for example cuda:2
only or cuda:1
and cuda:2
). Because sometimes specific GPUs on the server are being used by other students in our research lab.
Selecting a single GPU
To select which GPU to use for a process, either set the CUDA_VISIBLE_DEVICES
environment variable
CUDA_VISIBLE_DEVICES=YOUR_GPU_NUMBER
or do it at the beginning of your python script
import torch
torch.cuda.set_device(0)
Multi-GPU
SPT has only been tested on a single GPU. We do not guarantee multi-GPU preprocessing nor training. Besides, a 40 G GPU (eg NVIDIA V100) is plenty enough for preprocessing and training all datatest in our paper. So if you run into CUDA OOM errors, you might wanna check our related tips & tricks.
If you encounter issues with multi-GPU preprocessing or training and start investigating those, we would gladly welcome a PR 😉
Great! Exporting the CUDA_VISIBLE_DEVICES
worked. Thanks!
Surely we will do a PR if we work on multi-GPU, we are hopeful that we will build upon this architecture, we will share with you accordingly. We are currently having some issues in testing, once we sort them out we will let you know so that this issue can be closed. 😉
We are both thankful to you for the prompt help.
Issue Regarding Validation Loss
We were just analyzing the results for our last training with 3 GPUs (Multi-GPU) on our dataset.
✔️ The train loss looks good and is converging...
❌ ...but we found that the validation loss is very high and not decreasing. This appears to be due to over-fitting...
Here is our graph for validation mIoU:
Could you please suggest a way to reduce the validation loss? 😬
Could it be solved by increasing the number of superpoints (ie by increasing the pcp_spatial_weights
) as suggested in #36 .
Hi, indeed your validation validation loss is comparatively high, but relatively stable. Beyond a decreasing validation loss, what you truly want, is your validation mIoU to increase. This seems to be the case, though.
Whether the final validation performance of 36.7 mIoU is "good" will depend on your specific dataset. This is not something I can do for you, you will need to investigate it yourself. How ? You should start by doing a lot of visualizations:
- Are your partition looking good ? Are they bleeding between objects ?
- What do your superpoint edges look like ? Are neighboring superpoints well connected ?
- What do your predictions look like ? Is there a class your model is really bad at ?
- ...
You can find some visualization tools provided in notebooks/
to help you get started. For more advanced visualization options, see the show()
function.
The rest is up to you, good luck ! 💪
Thanks a lot for helping us out!!
C'était vraiment utile@drprojects! L'installation
xy_tiling=null
a vraiment résolu les erreurs lors du traitement, et nous sommes maintenant en mesure de terminer le traitement en deux étapes. Mais nous sommes confrontés aux problèmes suivants :👉 Impossible de changer le périphérique d'entraînement (GPU)
J'ai un serveur qui possède 3 périphériques GPU (avec RAM : 40 Go, 80 Go et 80 Go). Par défaut, le traitement et l'entraînement utilisent tous deux le premier GPU. Mais cela a conduit à une erreur de segmentation, j'ai donc dû changer le périphérique en
cuda:2
in/configs/datamodule/my_data.yaml
. Cela a empêché l'erreur de segmentation pendant le traitement. Mais au début de l'entraînement, l'erreur OOM se bloque immédiatement et le périphérique utilisé estdevice 0
(la même erreur apparaît également si nous utilisons les 3 périphériques car je pense que le plus petit GPU sort de la mémoire et met fin à l'ensemble de l'entraînement). Il semble que le traitement distribué prenne automatiquement les premiersn
GPU oùn
est spécifié dans legpu.yaml
fichier.✔️ Bien que nous soyons en mesure d'entraîner l'architecture sur notre ensemble de données en utilisant les 3 périphériques GPU en définissant
devices: 3
mais/configs/trainer/gpu.yaml
... ❌ ... mais nous ne parvenons pas à trouver un moyen de spécifier des périphériques GPU spécifiques pour l'entraînement (par exemplecuda:2
uniquement oucuda:1
etcuda:2
). Parce que parfois, des GPU spécifiques sur le serveur sont utilisés par d'autres étudiants dans notre laboratoire de recherche.
Hello, I have the same problem, please where did you do this 'xy_tiling=null' i.e., in which file?
C'était vraiment utile @drprojects! L'installation
xy_tiling=null
a vraiment résolu les erreurs lors du traitement, et nous sommes maintenant en mesure de terminer le traitement en deux étapes. Mais nous sommes confrontés aux problèmes suivants :👉 Impossible de changer le périphérique d'entraînement (GPU)
J'ai un serveur qui possède 3 périphériques GPU (avec RAM : 40 Go, 80 Go et 80 Go). Par défaut, le traitement et l'entraînement utilisent tous les deux le premier GPU. Mais cela a conduit à une erreur de segmentation, j'ai donc dû changer le périphérique
cuda:2
en/configs/datamodule/my_data.yaml
. Cela évite l'erreur de segmentation pendant le traitement. Mais au début de l'entraînement, l'erreur OOM se bloque immédiatement et le périphérique utilisé estdevice 0
(la même erreur apparaît également si nous utilisons les 3 périphériques car je pense que le plus petit GPU sort de la mémoire et met fin à l' ensemble d'entraînement). Il semble que le traitement distribué prenne automatiquement les premiersn
GPU oùn
est spécifié dans legpu.yaml
fichier.
✔️ Bien que nous soyons en mesure d'entraîner l'architecture sur notre ensemble de données en utilisant les 3 périphériques GPU en définissantdevices: 3
plus/configs/trainer/gpu.yaml
... ❌ ... mais nous ne parvenons pas à trouver un moyen de spécifier des périphériques GPU spécifiques pour l'entraînement (par exemplecuda:2
uniquement oucuda:1
etcuda:2
). Parce que parfois, des GPU spécifiques sur le serveur sont utilisés par d'autres étudiants dans notre laboratoire de recherche.Bonjour, j'ai le même problème, s'il vous plaît où avez-vous fait cela 'xy_tiling=null' c'est à dire dans quel fichier ?
@drprojects Thank you for your efforts. Please, if you can give me the file where I can put this 'xy_tiling=null' to disable tiling, Just to speed up the work.