try split dataset by self

Question

try split dataset by self

aguang1201 opened this issue 7 years ago · 3 comments

hi,
bruce.
version 0.3.0 is very good.Congratulations！
I want to share one of the problems I encountered.
I setted the config as:
use_default_split=false,
train_patient_count=29405,
dev_patient_count=1400,
split_dataset_random_state=2
I increase the training dataset count.I think it can improve the mean AUC in result,but it's not.
The result is:mean auroc: 0.7680469487585274.Less than default setting result.
I do not understand why．Would you tell me why default split worked better?
And how to set the number of train_patient_count for improving the AUC.
Thank you for your nice work.

Answer 1 · 2018-03-06T11:18:58.000Z

@aguang1201 use_default_split option is deprecated in 0.3.0. please specify your own dataset split by using the new option dataset_csv_dir. Please check the sample.config.ini for the detail. I make this decision because many people find it confusing. I will update the source code to alarm people who use these deprecated options. Thank you.

Answer 2 · 2018-03-07T00:45:46.000Z

@brucechou1983
Thank you for your reply.
Unfortunately, it was deleted.I think splitting dataset is very useful.
But it does not matter, I can add it in my code.
I just want to look into why I increased the training data, AUC but lower.
Is the amount of train data is enough,Or the dev data is not enough?
This is really hard to figure out．
In addition:Have you tried NASNET?
Thanks.

Answer 3 · 2018-03-07T18:26:52.000Z

@aguang1201 Couldn't understand you. Since this issue is resolved, I will close it. You could send me email if you have adhoc questions.