Noble-Lab/casanovo

More information about the train/val/test split

andradesalazar opened this issue · 2 comments

Hi all,

in your latest manuscript you mention that the 30 mio. PSMs from MassIVE-KB were randomly split so that the training, validation and test sets are disjoint at peptide level.
I was wondering whether it's possible to provide more information about the split to be able to reproduce your results.
It would be enough to provide a simple table with the columns peptide and split (containing "train", "val", "test").

Thanks a lot in advance.

Best,
Daniela

Hi Daniela,

You can download the MassIVE-KB train, validation and test splits used for Casanovo training from here: https://noble.gs.washington.edu/~melih/mskb_casanovo_splits.zip

Zipped archive contains three MGF files corresponding to each of the splits and a parquet file with metadata.

For future reference, the dataset will be temporarily available at this URL and we'll find a permanent home for it soon.

Thank you :)