An error in Training MagicPoint on Synthetic Shapes

Question

An error in Training MagicPoint on Synthetic Shapes

kamen-kkr opened this issue 9 months ago · 6 comments

Greetings! Thank you very much for your contribution to the pytorch version of Superpoint.
I followed the steps, but I have a problem with the first step of training magicpoint.

I used the command
"python train4.py train_base configs/magicpoint_shapes_pair.yaml magicpoint_synth --eval"
This produces the error
"NotImplementedError: pool objects cannot be passed between processes or pickled"
After running, the synthetic dataset appears in "dataset/synthetic_shape_v6", but the weights file does not appear in "logs\magicpoint_synth\checkpoints" (the training has not started yet).

Here is the screenshot of the error report:

I would like to express my thanks again and hope to get your reply and answer!😀

Answer 1 · 2024-03-19T11:34:52.000Z

by changing the lines 48, 49 in the file utils/loader.py as follow:
workers_train = training_params.get('workers_train', 0) # 1 16
workers_val = training_params.get('workers_val', 0) # 1 16

It looks like an error caused by pickle when running on multiple workers.

Answer 2 · 2024-03-19T12:49:32.000Z

@xiaolangha Thank you very much for your answer! The problem has been solved successfully. The magic point can be trained normally now😀.

Answer 3 · 2024-03-19T13:30:30.000Z

@kamen-kkr I saw it in the issues, you can take a look: #54 . I am also reproducing this, but I always make mistakes. Currently, I am also training step 1: Python train4. py train_base configs/magicpoint_shapes-pair. yaml magicpoint_synth -- eval. If there are any issues, we can communicate together

Answer 4 · 2024-03-20T01:47:23.000Z

@xiaolangha In Step 1, what was your training speed? I have only completed 1/4 of the total rounds after 8 hours of training, due to the parallel works being set to 0.

Answer 5 · 2024-03-22T15:09:42.000Z

@xiaolangha In Step 1, what was your training speed? I have only completed 1/4 of the total rounds after 8 hours of training, due to the parallel works being set to 0.

I had the same question and I think I find the way to solve it. The step 1 takes 27hours at the speed of 3 it/s. Then I find if you set the workers_train larger (such as 8 I set), the speed comes to 10 it/s and it will use 8 cpus.

Answer 6 · 2024-03-23T03:38:40.000Z

@jehovahlbf @kamen-kkr
Due to the constant power outage here, I did not have a complete training process.
But I did a test as you said,
（1）workers_train ', 0, workers_val', 0.
In this case, if batch_size is 64, the speed is 1-2it/s, and 1 GPU is used.
If batch_size is 8 and the speed reaches 5-7it/s, use 1 GPU.

Of course, I haven't finished training, it's just what I saw at the beginning of the training

（2）Workers_train ', 8, in this case, the previous error is reported again: "NotImplementedError: pool objects cannot be passed between processes or pickled"

File "D:\DeepLearning\Anaconda3\envs\pytorch-superpoint-master\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "D:\DeepLearning\Anaconda3\envs\pytorch-superpoint-master\lib\multiprocessing\pool.py", line 535, in reduce
'pool objects cannot be passed between processes or pickled'
NotImplementedError: pool objects cannot be passed between processes or pickled

(pytorch-superpoint-master) D:\DeepLearning\two\hwj_DeepLearning\pytorch-superpoint-master>Traceback (most recent call last):
File "", line 1, in
File "D:\DeepLearning\Anaconda3\envs\pytorch-superpoint-master\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\DeepLearning\Anaconda3\envs\pytorch-superpoint-master\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input