KatherLab/HIA

Example on MIL/CLAM

Tato14 opened this issue · 12 comments

Hi,

Could you please share an ExperimentFile using MIL/CLAM pipeline?

Thanks

@Tato14 I edited the Experiment file in the rpeo. I hope it will solve the problem.

Hi. Thanks for the reply. I didn't notice that you can specify mil, clam_sb and clam_mb in the modelName parameter.

However, I am still having some issues. It seems that there is an issue when you try to create the splits for the CV. In the following code you can see how train/val/test splits return 0 samples:

Namespace(B=8, adressExp='/mnt/isilon/Lung_HMAR_TCGA/train_Level1/ExperimentFile_MIL_default.txt', bag_loss='ce', bag_weight=0.7, batch_size=1, clini_dir=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/LungDX_CLINI.xlsx'], csv_name='CLEANED_DATA', datadir_train=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/BLOCKS_NORM_MACENKO'], drop_out=True, early_stopping=False, feat_dir=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/FEATURES'], feature_extract=False, freeze_Ratio=0.5, gpuNo=0, inst_loss='svm', k=3, log_data=True, lr=0.0001, maxBlockNum=512, max_epochs=10, model_name='mil', model_size='big', no_inst_cluster=False, normalize_targetNum=False, numHighScoreBlocks=20, numHighScorePatients=10, opt='adam', project_name='ExperimentFile_MIL_default', reg=1e-05, seed=1, slide_dir=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/LungDX_SLIDE.csv'], subtyping=False, target_labels=['lung_type'], testing=False, train_full=False, useClassicModel=False, weighted_sample=True)
1
LOADING DATA FROM/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/BLOCKS_NORM_MACENKO...
Remove the NaN values from the Target Label...
**********************************************************************
0 Patients didnt have the proper label for target label: lung_type
**********************************************************************
Data for 0 Patients from Clini Table is not found in Slide Table!
Data for 0 Patients from Slide Table is not found in Clini Table!
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1019/1019 [00:02<00:00, 459.83it/s]
FINISHED!
TOTAL NUMBER OF PATIENTS:1019
label column: lung_type
label dictionary: {'LUAD': 0, 'LUSC': 1}
number of classes: 2
Patient-LVL; Number of samples registered in class 0: 534
Patient-LVL; Number of samples registered in class 1: 485
##############################################################


Load the DataSet...
label column: lung_type
label dictionary: {'LUAD': 0, 'LUSC': 1}
number of classes: 2
slide-level counts:  
 0    534
1    485
Name: label, dtype: int64
Patient-LVL; Number of samples registered in class 0: 534
Slide-LVL; Number of samples registered in class 0: 534
Patient-LVL; Number of samples registered in class 1: 485
Slide-LVL; Number of samples registered in class 1: 485
##############################################################

**********************************************************************
START OF CROSS VALIDATION
**********************************************************************
340

Training Fold 0!

Init train/val/test splits... ******************************************************************
Training on 0 samples
Validating on 0 samples
Testing on 0 samples
******************************************************************
Done!

Init loss function... Done!

Init Model... Done!
MIL_fc(
  (classifier): DataParallel(
    (module): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
      (2): Dropout(p=0.25, inplace=False)
      (3): Linear(in_features=512, out_features=2, bias=True)
    )
  )
)
Total number of parameters: 525826
Total number of trainable parameters: 525826

Init optimizer ... Done!

Init Loaders... Traceback (most recent call last):
  File "/home/jgibert/KatherLab/HIA/Main.py", line 40, in <module>
    ClamMILTraining(args)
  File "/home/jgibert/KatherLab/HIA/ClamMILTraining.py", line 141, in ClamMILTraining
    test_auc, val_auc, test_acc, val_acc, patient_results  = Train_MIL_CLAM(datasets, i, args)
  File "/home/jgibert/KatherLab/HIA/utils/core_utils.py", line 123, in Train_MIL_CLAM
    train_loader = Get_split_loader(train_split, training = True, testing = args.testing, weighted = args.weighted_sample)
  File "/home/jgibert/KatherLab/HIA/utils/data_utils.py", line 278, in Get_split_loader
    weights = Make_weights_for_balanced_classes_split(split_dataset)
  File "/home/jgibert/KatherLab/HIA/utils/data_utils.py", line 296, in Make_weights_for_balanced_classes_split
    weight_per_class = [N/len(dataset.slide_cls_ids[c]) for c in range(len(dataset.slide_cls_ids))]                                                                                                     
  File "/home/jgibert/KatherLab/HIA/utils/data_utils.py", line 296, in <listcomp>
    weight_per_class = [N/len(dataset.slide_cls_ids[c]) for c in range(len(dataset.slide_cls_ids))]                                                                                                     
ZeroDivisionError: float division by zero

Surfing a little bit on the repo, I found that Get_split_from_df use a self.slide_data that I am not able to find. Do you have any hints on what could be missing there? Thanks

@Tato14 This problem should have been solved now. Can you check it please and write me back about the result?

@narminGhaffari It seems that the same error persists.

Namespace(B=8, adressExp='/mnt/isilon/Lung_HMAR_TCGA/train_Level1/ExperimentFile_MIL_default.txt', bag_loss='ce', bag_weight=0.7, batch_size=1, clini_dir=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/LungDX_CLINI.xlsx'], csv_name='CLEANED_DATA', datadir_train=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/BLOCKS_NORM_MACENKO'], drop_out=True, early_stopping=False, feat_dir=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/FEATURES'], feature_extract=False, freeze_Ratio=0.5, gpuNo=0, inst_loss='svm', k=3, log_data=True, lr=0.0001, maxBlockNum=512, max_epochs=10, model_name='mil', model_size='big', no_inst_cluster=False, normalize_targetNum=False, numHighScoreBlocks=20, numHighScorePatients=10, opt='adam', project_name='ExperimentFile_MIL_default', reg=1e-05, seed=1, slide_dir=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/LungDX_SLIDE.csv'], subtyping=False, target_labels=['lung_type'], testing=False, train_full=False, useClassicModel=False, weighted_sample=True)
1
LOADING DATA FROM/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/BLOCKS_NORM_MACENKO...
Remove the NaN values from the Target Label...
**********************************************************************
0 Patients didnt have the proper label for target label: lung_type
**********************************************************************
Data for 0 Patients from Clini Table is not found in Slide Table!
Data for 0 Patients from Slide Table is not found in Clini Table!
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1019/1019 [00:02<00:00, 472.13it/s]
FINISHED!
TOTAL NUMBER OF PATIENTS:1019
label column: lung_type
label dictionary: {'LUAD': 0, 'LUSC': 1}
number of classes: 2
Patient-LVL; Number of samples registered in class 0: 534
Patient-LVL; Number of samples registered in class 1: 485
##############################################################


Load the DataSet...
label column: lung_type
label dictionary: {'LUAD': 0, 'LUSC': 1}
number of classes: 2
slide-level counts:  
 0    534
1    485
Name: label, dtype: int64
Patient-LVL; Number of samples registered in class 0: 534
Slide-LVL; Number of samples registered in class 0: 534
Patient-LVL; Number of samples registered in class 1: 485
Slide-LVL; Number of samples registered in class 1: 485
##############################################################

**********************************************************************
START OF CROSS VALIDATION
**********************************************************************
340

Training Fold 0!

Init train/val/test splits... ******************************************************************
Training on 0 samples
Validating on 0 samples
Testing on 0 samples
******************************************************************
Done!

Init loss function... Done!

Init Model... Done!
MIL_fc(
  (classifier): DataParallel(
    (module): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
      (2): Dropout(p=0.25, inplace=False)
      (3): Linear(in_features=512, out_features=2, bias=True)
    )
  )
)
Total number of parameters: 525826
Total number of trainable parameters: 525826

Init optimizer ... Done!

Init Loaders... Traceback (most recent call last):
  File "/home/jgibert/KatherLab/HIA/Main.py", line 40, in <module>
    ClamMILTraining(args)
  File "/home/jgibert/KatherLab/HIA/ClamMILTraining.py", line 137, in ClamMILTraining
    patient_results, aucList  = Train_MIL_CLAM(datasets = datasets, cur = i, args = args)
  File "/home/jgibert/KatherLab/HIA/utils/core_utils.py", line 123, in Train_MIL_CLAM
    train_loader = Get_split_loader(train_split, training = True, testing = args.testing, weighted = args.weighted_sample)
  File "/home/jgibert/KatherLab/HIA/utils/data_utils.py", line 278, in Get_split_loader
    weights = Make_weights_for_balanced_classes_split(split_dataset)
  File "/home/jgibert/KatherLab/HIA/utils/data_utils.py", line 296, in Make_weights_for_balanced_classes_split
    weight_per_class = [N/len(dataset.slide_cls_ids[c]) for c in range(len(dataset.slide_cls_ids))]                                                                                                     
  File "/home/jgibert/KatherLab/HIA/utils/data_utils.py", line 296, in <listcomp>
    weight_per_class = [N/len(dataset.slide_cls_ids[c]) for c in range(len(dataset.slide_cls_ids))]                                                                                                     
ZeroDivisionError: float division by zero

@Tato14 It seems that your extract_feature flag is still False, so it is not extracting the feature vectors and since the corresponding folder is empty, then it is not able to load them.

@narminGhaffari sorry I didn't saw this. Still getting an error:

###############################
Traceback (most recent call last):
  File "/home/jgibert/KatherLab/HIA/Main.py", line 40, in <module>
    ClamMILTraining(args)
  File "/home/jgibert/KatherLab/HIA/ClamMILTraining.py", line 53, in ClamMILTraining
    ExtractFeatures(data_dir = imgs, feat_dir = args.feat_dir, batch_size = args.batch_size, target_patch_size = -1, filterData = True,self_supervised = args.self_supervised)
AttributeError: 'Namespace' object has no attribute 'self_supervised'

Tried to add "self_supervised":"True", to ExperimentFile but the error persists. Moreover, I am not sure if that's what it expects...

@Tato14 Check it now please!

Hi, it seems that we are improving. I am still getting an error but I guess is because of the filename, could you confirm that?

Namespace(B=8, adressExp='/mnt/isilon/Lung_HMAR_TCGA/train_Level1/ExperimentFile_MIL_default.txt', bag_loss='ce', bag_weight=0.7, batch_size=1, clini_dir=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/LungDX_CLINI.xlsx'], csv_name='CLEANED_DATA', datadir_train=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/BLOCKS_NORM_MACENKO'], drop_out=True, early_stopping=False, feat_dir=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/FEATURES'], feature_extract=True, freeze_Ratio=0.5, gpuNo=0, inst_loss='svm', k=3, log_data=True, lr=0.0001, maxBlockNum=512, max_epochs=10, model_name='mil', model_size='big', no_inst_cluster=False, normalize_targetNum=False, numHighScoreBlocks=20, numHighScorePatients=10, opt='adam', project_name='ExperimentFile_MIL_default', reg=1e-05, seed=1, slide_dir=['/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/LungDX_SLIDE.csv'], subtyping=False, target_labels=['lung_type'], testing=False, train_full=False, useClassicModel=False, weighted_sample=True)
###############################
initializing dataset
loading model checkpoint

progress: 0/1395
TCGA-77-7139-01Z-00-DX1
processing /mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/BLOCKS_NORM_MACENKO/TCGA-77-7139-01Z-00-DX1: total of 1427 batches
Traceback (most recent call last):
  File "/home/jgibert/KatherLab/HIA/Main.py", line 40, in <module>
    ClamMILTraining(args)
  File "/home/jgibert/KatherLab/HIA/ClamMILTraining.py", line 53, in ClamMILTraining
    ExtractFeatures(data_dir = imgs, feat_dir = args.feat_dir, batch_size = args.batch_size, target_patch_size = -1, filterData = True)
  File "/home/jgibert/KatherLab/HIA/extractFeatures.py", line 115, in ExtractFeatures
    output_file_path = Compute_w_loader(file_path, output_path, 
  File "/home/jgibert/KatherLab/HIA/extractFeatures.py", line 60, in Compute_w_loader
    for count, (batch, coords) in enumerate(loader):
  File "/home/jgibert/anaconda3/envs/PyTorch_Bioformats/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/jgibert/anaconda3/envs/PyTorch_Bioformats/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/jgibert/anaconda3/envs/PyTorch_Bioformats/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jgibert/anaconda3/envs/PyTorch_Bioformats/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jgibert/KatherLab/HIA/dataGenerator/dataSet.py", line 65, in __getitem__
    coord =[int(temp.split(',')[0]) , int(temp.split(',')[1])]
ValueError: invalid literal for int() with base 10: '/mnt/isilon/Lung_HMAR_TCGA/train_Level1/LungDX/BLOCKS_NORM_MACENKO/TCGA-77-7139-01Z-00-DX1/TCGA-77-7139-01Z-00-DX1_1_12800-14848-13312-15360_.png'

Thanks!

@Tato14 Yes, it is. Our workflow creates patches with names like aaaa_(123, 455).png.
The numbers inside () are the coordinates of the patch in the whole slide image. If you don't have this structure, then maybe you don't need to have the coord variable at all and you can comment it.

@narminGhaffari thanks for the clarification. Since the dataloader expect a filename and coordinate pair was easier to edit the code for my specific filename structure. Now it seems that everything is working nicely!

Just one more thing before closing:
In this line I think you should expect patientList instead of lengthList.

Thanks again for the great feedback!

@narminGhaffari still having the

Init train/val/test splits... ******************************************************************
Training on 0 samples
Validating on 0 samples
Testing on 0 samples
******************************************************************
Done!

Init loss function... Done!

Init Model... Done!
MIL_fc(
  (classifier): DataParallel(
    (module): Sequential(
      (0): Linear(in_features=1024, out_features=512, bias=True)
      (1): ReLU()
      (2): Dropout(p=0.25, inplace=False)
      (3): Linear(in_features=512, out_features=2, bias=True)
    )
  )
)
Total number of parameters: 525826
Total number of trainable parameters: 525826

Init optimizer ... Done!

Init Loaders... Traceback (most recent call last):
  File "/home/jgibert/KatherLab/HIA/Main.py", line 40, in <module>
    ClamMILTraining(args)
  File "/home/jgibert/KatherLab/HIA/ClamMILTraining.py", line 137, in ClamMILTraining
    patient_results, aucList  = Train_MIL_CLAM(datasets = datasets, cur = i, args = args)
  File "/home/jgibert/KatherLab/HIA/utils/core_utils.py", line 123, in Train_MIL_CLAM
    train_loader = Get_split_loader(train_split, training = True, testing = args.testing, weighted = args.weighted_sample)
  File "/home/jgibert/KatherLab/HIA/utils/data_utils.py", line 278, in Get_split_loader
    weights = Make_weights_for_balanced_classes_split(split_dataset)
  File "/home/jgibert/KatherLab/HIA/utils/data_utils.py", line 296, in Make_weights_for_balanced_classes_split
    weight_per_class = [N/len(dataset.slide_cls_ids[c]) for c in range(len(dataset.slide_cls_ids))]                                                                                                     
  File "/home/jgibert/KatherLab/HIA/utils/data_utils.py", line 296, in <listcomp>
    weight_per_class = [N/len(dataset.slide_cls_ids[c]) for c in range(len(dataset.slide_cls_ids))]                                                                                                     
ZeroDivisionError: float division by zero

After feature extraction. It seems that the splits are not loaded properly but I am not quite sure why.

@Tato14 I am checking the repo, will write you back as soon as I found the problem.