Training data format
Closed this issue · 2 comments
Hi,
Thank you for this interesting and insightful work. I would like to follow your experimental setting. I have the following questions about the data format.
(1) How many training protein-ligand pairs you have after you filter out any poses that have RMSD greater than 2A? Is it 486740, which is the number of lines in it2_tt_0_lowrmsd_mols_train0_fixed.types.
(2) Could you explain the meaning of each line in it2_tt_0_lowrmsd_mols_train0_fixed.types? For example, what do the fist three numbers mean in the following line?
1 5.119186 1.97462 1433B_HUMAN_1_240_pep_0/4gnt_A_rec.pdb 1433B_HUMAN_1_240_pep_0/4gnt_A_rec_5f74_amp_lig_tt_min_0.sdf.gz #-6.28497
Thank you in advance.
(1) The provided types files contain filtered poses that all have RMSD < 2A, so the number of lines in the train0 file is the number of protein-ligand pairs in the training set.
(2) Sure. The columns are:
- binary label indicating if the pose is less than 2A RMSD from crystal pose
- binding affinity (0.0 if not available)
- RMSD from crystal pose
- path to receptor file, relative to data root
- path to ligand file, relative to data root
- vina energy (after the hash symbol)
Hi @mattragoza, Thank you for the response and sorry for my late reply. This solves my question.