christianpayer/MedicalDataAugmentationTool-VerSe

Some questions

zengchan opened this issue · 36 comments

When training main_ vertebrae_ localization.py, it's very slow,Ican omly train 10000 iters in 24 hours. When some samples are tested, they are stuck. I use one V100 gpu.

Regarding your observation on testing being stuck, I identified that the spine_postprocessing function makes some troubles when the networks are far from being converged. As I currently don't have time for investigating this further, I added a switch that disables postprocessing in the training scripts (set self.use_spine_postprocessing = False in the __init__ function of MainLoop). During inference the postprocessing is still enabled, as it should lead to better results.

Regarding runtime, I just committed a change in the networks that will use upsample3d_cubic instead of resize_tricubic, which should be faster when training on the GPU.
Furthermore, I would suggest you to use the dedicated script server_dataset_loop.py for faster image preprocessing (see in the README.md file for how to use it). The image preprocessing is unfortunately slow and a limiting factor, as it uses SimpleITK and is limited by the speed of the CPU. However, with the dedicated script, it should better utilize the CPU.

When using upsample3d_cubic and the server_dataset_loop.py for image preprocessing, 100 training iterations take around 104 seconds on my Titan V.

For the whole cross validation this still takes a couple of days to train. However, the results of the networks are already quite good after ~30,000 - 50,000 iterations. So for development, you don't need to train for the full 100,000 iterations. Moreover, if you are willing to investigate in this further, you could try to change the optimizer and its parameters like learning rate.

Thank you very much for taking the time to answer my questions. I'll try as you say.

Thanks for your work. I train the spine_localization, vertebrae_localization and vertebrae_segmentation by using upsample3d and server_dataset_loop.py as you say, all of them still slow on Tesla V100, 100 training iterations take around 600 seconds. The image preprocessing takes amount of time on CPU, so the utilization rates on GPU is very low, maybe I need to do some extra things since your training speed is much faster than mine, so I need your help.

Unfortunately, as you also recognized, due to the large input images, the runtime of the training scripts is limited by the data augmentation, which runs entirely on the CPU. There are a couple of ways how you can improve that.
First, I can tell you, how we are training on our workstations. I have a quite old Intel 4th generation i7 and a Geforce Titan Xp. Due to the slow CPU, I also cannot run the data loading, preprocessing, and augmentation on my CPU without it being a bottleneck. However, we have a server that has a much better CPU performance. Thus, I run the preprocessing on this server and connect the script running on my workstation to this server via Pyro4. In doing so, there is no CPU bottleneck any more and the GPU utilization is most of the time 100%. We also tried this with a Titan V and there is also no CPU bottleneck with our approach.

The steps to do this the same as us are the following:

  1. Copy the framework to a server that has faster CPUs.
  2. Copy the dataset to this server.
  3. Adapt the 'base_folder' at line 17 in https://github.com/christianpayer/MedicalDataAugmentationTool-VerSe/blob/master/training/server_dataset_loop.py to the local dataset folder on the server
  4. Start server_dataset_loop.py on the server. You should see something like the following output
    start PYRO:verse_dataset@server_name_or_ip:51232
  5. In the training scripts (main_*.py) set self.use_pyro_dataset = True and adapt the line server_name = @server_name_or_ip:51232.
  6. Start the training script. If everything worked, it should now connect to the server and train much faster. The output of the server_dataset_loop.py will show you how many images are currently in the FIFO queue that contains the preprocessed and augmented images.
  7. In case of errors, check the output of both server script and training script.

You could also run the server_dataset_loop.py on the same workstation on which you also do the training. It could give you already a performance boost, as the multithreading performance with Python is not the best and a dedicated data-loading-and-augmentation-process will make better use of the available CPU ressources.

If you do not have access to a faster server, you could also try to reduce the amount of preprocessing. For this, you need to adapt dataset.py. I don't know what works best, but I can give you some hints:

  1. Remove the deformation transformation in line 400, which performs random elastic deformations. While it can greatly improve the results, it requires unfortunately a lot of CPU time. This will probably give you the largest runtime boost.
  2. Change image interpolation from linear to nearest in line 256.
  3. In case you are training with larger datasets, you probably need to increase the cache_maxsize of CachedImageDataSource such that every loaded training image will fit into memory.

If you change something in dataset.py, make sure that the augmented images used for training still look reasonable. Thus, set self.save_debug_images = True and check the generated images. However, when doing the training, make sure that self.save_debug_images = False, as the image saving also takes lots of ressources.

Thanks for your answer in details. Finally, I run the preprocessing on a server that has 48 cpu cores and adapt the line super(VerseServerDataset, self).__init__(queue_size=48, refill_queue_factor=0.0, n_threads=48, use_multiprocessing=True) in server_dataset_loop.py, thus the GPU utilization can reach 98% and 100 training iterations take around 134 seconds. But when I train the spine locallization and vertebrae localization I find some problems, the training log is that:

 train iter: 5000 loss: 693.5380 loss_reg: 0.6406 loss_sigmas: 2886.8821 mean_sigmas: 3.7167 gradient_norm: 681.6187 seconds: 137.668
train iter: 5100 loss: 900.9958 loss_reg: 0.6406 loss_sigmas: 3734.5657 mean_sigmas: 3.7119 gradient_norm: 736.6319 seconds: 134.837
 train iter: 5200 loss: 755.7980 loss_reg: 0.6406 loss_sigmas: 3510.8040 mean_sigmas: 3.7069 gradient_norm: 681.3399 seconds: 134.924
 train iter: 5300 loss: 911.5110 loss_reg: 0.6406 loss_sigmas: 3729.2810 mean_sigmas: 3.7018 gradient_norm: 724.5123 seconds: 135.073
 train iter: 5400 loss: 778.4545 loss_reg: 0.6406 loss_sigmas: 3491.3569 mean_sigmas: 3.6967 gradient_norm: 740.0433 seconds: 135.534
train iter: 5500 loss: 909.1810 loss_reg: 0.6406 loss_sigmas: 3341.4968 mean_sigmas: 3.6917 gradient_norm: 766.0714 seconds: 135.019
train iter: 5600 loss: 927.2289 loss_reg: 0.6406 loss_sigmas: 3394.4741 mean_sigmas: 3.6871 gradient_norm: 863.3042 seconds: 135.890

The loss and loss_sigmas is too large and don't have downward trend after 5000 iterations, so I put the input.mha and output_hatmap.mha into ITK-SNAP, and I find that the oriention of heatmap is different with input. So I want to know wehther I have to take another preprocessing operations or I have some wrong with traning, I just run the reorient_reference_to_rai.py before training. Finally, thanks your work again.

Maybe you need to confirm the label carefully and revise the reorient_reference_to_rai.py code.

I think the reorient_reference_to_rai.py script is correct and the landmark files fit to the reoriented orientation. At least on my PC it is working. @zengchan can you confirm that the training is working, or is it for you not working as well?
@GoodMan0 make sure that the correct image folder is used for image loading (folder images_reoriented and not images). You should also confirm that the landmarks.csv file is correct.
I would suggest you to do the following to check, whether the network input is correct. Set self.save_debug_images = True and add 'generate_heatmaps': True to the dataset_parameters variable in any main*.py file. Now if you run the main*.py file, a folder debug_train (and debug_val in case of validation) should be created in the current working directory and every image and heatmap target that the network is seeing will be written to the harddisk. Please confirm that the input and the heatmap file match via ITK-SNAP. If they are not matching then something in preprocessing failed.

Sorry, I was wrong,the proprocess_landmarks.py should be revised according to the label(json format).
##############
coords = np.array([size[0]*spacing[0]-float(landmark['Z']),float(landmark['X']),size[2]*spacing[2]-float(landmark['Y'])])

@christianpayer @GoodMan0 I use 48cores CPU,240G CPU memory,one V100 gpu,16G,when I set super(VerseServerDataset, self).init(queue_size=24, refill_queue_factor=0.5, n_threads=4, use_multiprocessing=True),the CPU memory usage has been increasing and get an error (Error 32 Broken pipe)

I think the proprocess_landmarks.py script is also fine. At least, it works well for me.

Are you trying to generate the landmarks for the VerSe2019 dataset or the VerSe2020 dataset? All the files in this repository are (currently) for VerSe2019 only and need to be adapted for VerSe2020. If you compare the data descriptions for VerSe2019 (https://verse2019.grand-challenge.org/Data/) and VerSe2020 (https://verse2020.grand-challenge.org/Data/) you will see that the landmark annotation format changed. I adapted the scripts already and plan to upload it today to this github repository. Then, it will also work for VerSe2020.

yes, proprocess_landmarks.py can be changed to coords = np.array([size[0]*spacing[0] float(landmark['Z']),float(landmark['X']),size[2]*spacing[2]-float(landmark['Y'])]), and I cannot use the server_dataset_loop.py, I donnot know why.Can you answer it for me? Thank you very much.@christianpayer

@zengchan Yes, this is the line that you need to do to generate the landmarks for VerSe2020.
Regarding your broken pipe error, it could have multiple causes. First, make sure that you have a stable network connection with at least 1000Mbps.
Does the server or the client program crash? If so, do you know, why? e.g. does one of the programs need too much memory?
Your init parameters seem a little problematic. I typically use a larger queue_size and a refill_queue_factor of 0.0. If you use a refill_queue_factor = 0.5, then the same augmented image is put into the queue twice, if it the queue is less then half full. This could lead to bad local minima during training. So you need to be cautious with this parameter. If you have memory errors, you could try to set use_multiprocessing=False, as it will then use multi threading instead of multi processing, which allows shared memory in the CachedImageDataSource.

Yes,if set queue_size =48, refill_queue_factor =0.0, n_threads=8, use_multiprocessing=False, the training speed is much lower(about 250s for 100 epoch)

Yes, this makes sense. The augmentation operations seem to be the bottleneck. refill_queue_factor = 0.5 will make it much faster, but it is really problematic. When the client takes an object out of the queue and the queue is less then half full, then it will put the same image back into the queue. This means that the augmented image is seen at least twice. Thus, you will have the best speed, but a much worse network performance, since the completely same image will be seen after only a few iterations. This will remove the benefit of extensive data augmentation, which I think is crucial for the network to perform well.

tl;dr: set refill_queue_factor = 0.0 to get better results.

As the VerSe2020 dataset contains more images, while the images themselves are larger, they do not fit into the default cache size of the CachedImageDataSource. Furthermore, some images are saved as float32 and not int16 or uint8, which additionally increases the memory consumption.
So I can give you two hints on how to decrease memory consumption: First, set the cache_maxsize parameter of CachedImageDataSource to a larger value than the default 8192 (in dataset.py). Second, make sure that the images that you load are int16, while the labels are uint8 (I plan to commit today an updated reorient_reference_to_rai.py script that ensures that).
This should improve the speed of the data augmentation, as the images fit into the cache.

There may be some problems with some parts of the code pyto_dataset.py, and there maybe a memory leak when using multi-threaded processing.This confuses me and is beyond my ability.

I also occasionally observed a memory leak when using use_multiprocessing = False and connecting multiple times to the same server. I did not have time to more deeply investigate in that. Also because a simple restart of the server_dataset_loop.py in between runs was sufficient for me. With use_multiprocessing = True I did not observe a memory leak. However, I only tested it on our setup, so I can't confirm that it works also for other setups (operating systems, python versions, package versions, etc.) without any problems.
I can give you some hints on where you need to look, but unfortunately I cannot give you more extensive support, due to my limited time.

Sorry for taking up your precious time, I will keep trying, thank you very much.

No problem! I'm glad if people are using our methods and I can help fixing issues.

Is the calculation of your evaluation indicator id_rate and MLD consistent with the official definition of Verse2019?

No, it is not. In this repository, we calculate the id_rate and average MLD per landmark. I think, in the official VerSe challenge paper, they calculate the id_rate and MLD per patient and then calculate the average. Furthermore, I think they also set the MLD = 500 (or maybe some other value) for a missed landmark. In this repository, we just ignore missed landmarks.

Is your evaluation index calculation close to the official calculation?

The values themselves should be close. Both ours and their values are averages, but with different weighting factors. However, individual landmark outliers should have a larger influence in our calculation of the standard deviation.

Got it, thanks

@christianpayer @zengchan I'm sorry for bothering you again. As my training on spine_localization, I find the results are much different with different cross entropy folder. In terms of my experiments, the result on cv0 is mean: 9.85, std: 5.09, median: 9.05 and on cv1 is mean: 7.50, std: 4.35, median: 6.01, while the result on cv2 is mean: 36.39, std: 143.34, median: 5.79 that is much worse than 9.85 and 7.50. And for vertebrae_localization, the training loss is still very large and don't have downward trend after 20000 iters. All the above results are based on VerSe19 dataset, so I want to know whether you get the same results and it would be better if you could release the training log or the results on VerSe'19 dataset, thanks.

I would like to ask, do more predictive vertebrae have a great impact on the results? When calculating the id_rate, do you consider all the vertebrae in the test set or all the predicted vertebrae?

@GoodMan0 sorry for the delayed response. Regarding your experiments for the spine localization: the number you see with the training script (main_spine_localization.py) are not the reported ones. As we are only using the center of mass in x and y coordinate of the spine, in our VISAPP paper, we reported the point error for x and y coordinates of the predicted landmark x_spine. In the training script however, the point error is calculated for x, y, and z coordinate (this seems to be bug in the training scripts). If you investigate your results more closely, you can see that the median error is small (and similar for all experiments), while the std is extremely high for cv2. This indicates that there is at least one extreme outlier. I suppose, the outlier is the image where the whole legs are visible.
Regarding your observations for the vertebrae localization: Training is quite slow in the beginning. While it looks like the loss is not going down (the sigma loss will go down, while the network loss will go up), training should still progress. I just reran the training scripts for 20000 iterations, and the heatmap output images already start create responses on the landmarks' locations. However, you could experiment with using fixed sigmas or using other optimizers, which could lead to faster convergence (e.g. Adam with higher learning rates).

I will look for the loss progression files that I obtained from the networks as reported in our VISAPP paper and upload them to this repository.

@zengchan The id_rate tells you how many of the groundtruth vertebrae are correctly identified. If you do not predict a landmark, which is annotated in the groundtruth, it won't be counted as identified and the id_rate will shrink. If there is an additional landmark prediction, which is not annotated in the groundtruth, it is not considered in the id_rate and it will not be affected. This is how we calculated it, however, I also think that in the official evaluation protocol of the VerSe2019 challenge, additional vertebrae are not evaluated and not considered as errors.

As I wrote previously, we do not calculate the average measures per image, but the total average over all images. You would need to adapt the evaluation code in the framework to get the exactly same measures as used in VerSe2019.
Additionally, in the VerSe2019 challenge report, missing landmarks are set to have a point error of 500 (I think). In our framework, we do not consider missing landmarks at all.

(This is just how I recall the evaluation protocol of VerSe2019. You should check the paper and the website of the challenge.)

@christianpayer
Hi, thanks for your working again. It is very helpful.
I have the same problem with @zengchan that when some cases are tested, the testing process will be stuck in "vertebrae localization phase".
As you said, I set the [self.use_spine_postprocessing = False] in training phase and the model could be trained successfully.
But when I loaded the trained model for inference (with post-processing) , this problem happened again.
I found the cases that lead to stuck situation are always the same cases, which are [verse082, verse097 and verse266].
I tried to find out the reasons but failed, It would be greatly appreciated that if you can handle this problem.
Thank you very much!

@zhuo-cheng Sorry for the delayed response. It seems that there is some problem with the SpinePostprocessing. The code of this class is not the cleanest and could make some problems. However, I did not observe problems with my trained models. Probably due to other heatmap predictions, you get different local maxima on the individual heatmaps, and the code in SpinePostprocessing runs into a very time consuming (or maybe infinite) loop. Unfortunately, without your exact heatmap predictions I cannot reproduce your problems.
However, you could still investigate the problem on your computer. I would suggest you to look into the local_maxima_landmarks that are produced by the HeatmapTest class (lines 259 in main_vertebrae_localization.py). This object will contain a list of list of landmarks. If the nested list is too large, i.e., too many local maxima for a landmark are found, spine_postprocessing (line 260) could take a very long time. You could try to adapt the parameters of HeatmapTest such that fewer local maxima get returned (e.g., increase min_max_value and multiple_min_max_value_factor).
Currently, I don't have time for improving or fixing the SpinePostprocessing class, but I hope I could help you anyways.

@Gabriel33hasbeenused The framework uses tensorflow version 1. I think from 1.4 to 1.15 should work. We worked with tensorflow 1.14.
Our framework should run on both CPU and GPU versions of tensorflow (i.e., tensorflow and tensorflow-gpu). You would need to adapt the data_format of the Tensors to work on the CPU.
Regarding specific CUDA and CUDNN versions: our framework is not dependent on specific versions, but tensorflow-gpu is. If you manage to make tensorflow-gpu work in your environment, also our framework will work. See https://www.tensorflow.org/install/gpu for how to install tensorflow-gpu. Also try some minimal working examples to make sure that tensorflow-gpu is working on your machine.

@christianpayer Thanks for your reply. I will try it as you said. Thanks!!

@christianpayer

Hi. Congratulations to your great ranking in the Verse2020 challenge!
I noticed you used Tensorflow 2.2 this time.
May I ask that if I use Tensorflow 1.15 that you used before for Verse2020 dataset, can I reproduce your Verse2020 result (almost) same as Tensoflow 2.2.0 implementation?
Is different version the only difference between them?
Thanks!

@zhuo-cheng Thanks for your congratulations! Yes, the new version uses TF 2. I theory, you could also adapt the code to run in TF 1.15, however, I would strongly discourage that, as some interfaces in TF 2 are much easier to use. E.g. we use now mixed-precision training, which roughly halves the amount of required GPU memory and makes training faster. Although there might exist an interface in TF 1 for this (I don't know exactly), you would need to adapt our framework to use it.

@christianpayer

Got it! Thank you so much!

@zengchan @christianpayer @zhuo-cheng hello, when I run main_vertebrae_segmentation.py It's ValueError: Tensor's shape (3, 3, 3, 1, 96) is not compatible with supplied shape (3, 3, 3, 2, 96) can you solve this problem?thank you .if you debug pass ,can you share your debug code?