Calvin-Pang/MAg

About splitting the data

Opened this issue · 12 comments

How to split data according to your files? The code you gave, he uses all the patients for random splitting. There are several images that are corresponding to 1 patient, should I put all of them in the same folder?

If you mean the first stage patch-level training, the answer is YES. In the patch-level training, we put all MSS patches no matter which patient they belong to in one folder. So as MSI. So that in patch-level we only have six folders: train/MSI, train/MSS, validation/MSI, validation/MSS, test/MSI, test/MSI.

So for TCGA-AZ-4615 as an example in MSI test for CRC_DX, there are 61 pngs in the original dataset and I should put all of them in one folder for MSI test, right?

Yes, just use your trained patch-level models to get the predicted scores of these 61 pngs. And you can use any aggregation method (MAg, counting, averaging and so on) to get the patient's predicted result (MSI or MSS).

Sorry to bother you.why i find that there are 84 pngs of patient 'TCGA-AZ-4615' in the CRC_DX MSIMUT_test ?not 61 pngs.

Could you share the data with me? The already split one. Calvin Pang @.>于2022年3月21日 周一下午10:00写道:

If you mean the first stage patch-level training, the answer is YES. In the patch-level training, we put all MSS patches no matter which patient they belong to in one folder. So as MSI. So that in patch-level we only have six folders: train/MSI, train/MSS, validation/MSI, validation/MSS, test/MSI, test/MSI. — Reply to this email directly, view it on GitHub https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCalvin-Pang%2FMAg%2Fissues%2F1%23issuecomment-1074674153&data=04%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C8a8f4127cbd14ca8715108da0bb011e6%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637835148079539179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=CmOmG%2FRdjAUpSSGzN4pDUzRJEpPQJerHjAYl09ByVE4%3D&reserved=0, or unsubscribe https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQYEGJ7Z3LVDPFU2SIH26TLVBEZTLANCNFSM5RJPSG5A&data=04%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C8a8f4127cbd14ca8715108da0bb011e6%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637835148079539179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=brMOTNLpiHvipOtURqoPVJy8qnj1cirJqhZ998HOquk%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.
>

Hello, I checked my data again and I think in my experiment there is 84 pngs for TCGA-AZ-4615 in MSIMUT test. I think maybe you type the wrong id? Because I found there do have a patient with 61 pngs (TCGA-AZ-4315 in MSS test).
And I am sorry that I cannot share any image data with you beacuse in my experiments I put all images in one folder and use file names in json or xlsx to call them and train or test them. And my split patient xlsx files is in https://github.com/Calvin-Pang/MAg/tree/main/name_patient

I am sorry I don't know what's wrong with your split. I guess you have data loss when downloading and unzipping the dataset.
I just found my split dataset in my Google Drive, you may download and I hope it can help you.
CRC_DX: https://drive.google.com/drive/folders/1sQR_4_ZjOW8IWk8cMdsF2MTMOmfS9Y06?usp=sharing
STAD: https://drive.google.com/drive/folders/1ntb9MLvBx7ptyEA3dGhpQiM1nKg-jVCH?usp=sharing

Do you have the train_mobilenet.py file? I could'nt find it.

On Sat, Mar 26, 2022 at 4:19 AM Calvin Pang @.> wrote: I am sorry I don't know what's wrong with your split. I guess you have data loss when downloading and unzipping the dataset. I just found my split dataset in my Google Drive, you may download and I hope it can help you. CRC_DX: https://drive.google.com/drive/folders/1sQR_4_ZjOW8IWk8cMdsF2MTMOmfS9Y06?usp=sharing https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Fdrive%2Ffolders%2F1sQR_4_ZjOW8IWk8cMdsF2MTMOmfS9Y06%3Fusp%3Dsharing&data=04%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C7b41345f87614e5503f908da0f09c990%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637838831954767756%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BOyCvdfcmRHh2UCIaiZyGM8qPP57tEMUfhh3ODg5OT0%3D&reserved=0 STAD: https://drive.google.com/drive/folders/1ntb9MLvBx7ptyEA3dGhpQiM1nKg-jVCH?usp=sharing https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Fdrive%2Ffolders%2F1ntb9MLvBx7ptyEA3dGhpQiM1nKg-jVCH%3Fusp%3Dsharing&data=04%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C7b41345f87614e5503f908da0f09c990%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637838831954767756%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=hj%2FqA0LFT0kPfH57BKkTpjl%2F0UrPhNdEGC9YTcafCrg%3D&reserved=0 — Reply to this email directly, view it on GitHub https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCalvin-Pang%2FMAg%2Fissues%2F1%23issuecomment-1079647323&data=04%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C7b41345f87614e5503f908da0f09c990%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637838831954767756%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=D1jsCKPAf42HuUV3EQiiUjzXWenjRWDOk66NDzb0e8A%3D&reserved=0, or unsubscribe https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQYEGJY734QJXWAFTAW3MT3VB3JDTANCNFSM5RJPSG5A&data=04%7C01%7Czheng.wu%40mail-service-3-mx.vanderbilt.edu%7C7b41345f87614e5503f908da0f09c990%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637838831954767756%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Znp8sK3dwQkT9YvRp8q4TUeVHwuRp8CadAS4CVJ3Lts%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.>

Hello, I checked my files and found that the file train_mobilenet.py is just a modified train.py which was a experiment for parameter-frozen processing. And the result showed that the frozen processing is meaningless. So please just use train.py for all model(resnet, mobilenet,......)