securefederatedai/openfl

Model tf_2dunet - Plan initialisation fails expecting /raid/datasets/MICCAI_BraTS_2019_Data_Training/HGG/0

Closed this issue · 7 comments

Describe the bug
While trying the Quick Start Guide for model tf_2dunet, the plan initialisation step is failing.

Last few lines from the error message:

File "/home/azureuser/openfl/tests/openfl_e2e/my_workspace/src/tfbrats_inmemory.py", line 29, in __init__
    X_train, y_train, X_valid, y_valid = load_from_nifti(parent_dir=data_path,
  File "/home/azureuser/openfl/tests/openfl_e2e/my_workspace/src/brats_utils.py", line 94, in load_from_nifti
    subdirs = os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/raid/datasets/MICCAI_BraTS_2019_Data_Training/HGG/0'

To Reproduce
Steps to reproduce the behavior:

  1. Follow the steps mentioned in Quick Start replacing model torch_cnn_mnist with tf_2dunet
  2. Create workspace, certify it.
  3. Generate CSR request for aggregator with CA signing it.
  4. Initialise the plan - fx plan initialize

At this step the error is thrown.

Expected behavior
There should be no error during plan initialisation.

Screenshots
image

Machine

  • Ubuntu 22.04

Additional

There is this README.md which mentions dataset structure for MICCAI_BraTS_2019_Data_Training.
But how to download it exactly? Is this mentioned anywhere?

For practice purpose, I found this link having dataset - https://www.kaggle.com/datasets/aryashah2k/brain-tumor-segmentation-brats-2019 but it contains too many subfolders as opposed to expected 0 and 1.

fx plan initialize is currently taking the first entry from data.yaml. You either need to directly overwrite this to point at your dataset, or you can invoke the --input_shape flag if you know the expected data shape

To gain access to the data, originally you needed to send an access request to the MICCAI BraTS challenge, but that Kaggle link actually looks like the proper data. If so, the README.md includes steps to shard the data

If one is able to run an experiment after the fix, should also consider closing #366 #398

Hi @noopurintel,

I downloaded the dataset from Kaggle link that you have mentioned. https://www.kaggle.com/datasets/aryashah2k/brain-tumor-segmentation-brats-2019

After that I followed README.md.
I will list out the steps for you:

  1. Download the dataset from https://www.kaggle.com/datasets/aryashah2k/brain-tumor-segmentation-brats-2019
  2. Unzip the dataset unzip archive.zip -d /raid/datasets/
  3. Use Tree command to check unziped dataset: /raid/datasets# tree $DATA_PATH -L 2
.
-- MICCAI_BraTS_2019_Data_Training
    |-- HGG
    |-- LGG
    |-- name_mapping.csv
    `-- survival_data.csv

3 directories, 2 files
  1. cd MICCAI_BraTS_2019_Data_Training/HGG/
  2. export SUBFOLDER=HGG
  3. Run this code in terminal for 2 collaborators and change n as per number of collaborators as mentioned in the README.
for f in *; 
do 
    d=$(printf $((i%2)));  # change n to number of data slices (number of collaborators in federation)
    mkdir -p $d; 
    mv "$f" $d; 
    let i++; 
done
  1. Check the result raid/datasets/MICCAI_BraTS_2019_Data_Training/HGG# tree -L 1
.
|-- 0
`-- 1

2 directories, 0 files
  1. Follow Quick Start Guide
          INFO     Creating Initial Weights File    🠆 save/tf_2dunet_brats_init.pbuf                                                                plan.py:195
           INFO     FL-Plan hash is 196b877a93866735ca18687a2d1f94ad6dca8a3f0de541f84ca267ccc5fd63be00dd488102c0540c0b4efb434653b2c0                 plan.py:287
           INFO     ['plan_196b877a']                                                                                                                plan.py:222

 ✔️ OK

For the error mentioned below, I have fix in #1178.

File "<__array_function__ internals>", line 200, in concatenate
ValueError: need at least one array to concatenate

@noopurintel can you confirm this and let us know, and we will close it accordingly.

@rahulga1 @tanwarsh @kta-intel - We tried this. Below are our observations.

  1. With 4 CPUs and 16 GB RAM - the initialization process is killed on its own after 2-3 minutes into the run.
image (14)
  1. With 16 CPUs and 64 GB RAM - it took 4 hours to complete 2 rounds of training.

Could you please suggest/document what minimal configuration is required to test this? Also, in general how much time it expects to take for one round of training. It will be helpful for the users.

Hi @noopurintel,

I was able to complete the experiment with 10 rounds in 4-5 hrs with 16 CPU and 64 RAM.

Hi @noopurintel, Closing this issue as the error is resolved.