luwei0917/TankBind

How to match the atoms of the predicted ligand and the ground truth

Closed this issue · 7 comments

Thanks for your amazing work!

I ran the example prediction code and noticed that the atom order/permutation of the predicted ligand does not match the ground truth. I wonder how to calculate the correct RMSD in this situation.

Thanks!

Thanks.
I reorder the atom order of both ligand to have the same atom order as the corresponding SMILES.
sm = Chem.MolToSmiles(mol)
m_order = list(mol.GetPropsAsDict(includePrivate=True, includeComputed=True)['_smilesAtomOutputOrder'])
mol = Chem.RenumberAtoms(mol, m_order)

You could also use other packages such as DockRMSD.

Thanks for answering, this makes a lot of sense.

I have some detailed questions while reading the code.

  1. In eq.(4) and the description about enclosing in appendix J. What exactly does "covers more than 90% of the native interaction" mean?
  2. Stage 2 is an additional minimization loss process to predict the coordinates of the compound. Does this stage occur in training? If so, how does the equ.(5) update the model weights when D^pred_ij is fixed?
  3. What is the meaning of the "native_num_contact" (or data.is_equivalent_native_pocket/data.equivalent_native_y_mask) in TankBindDataset.get()?

Thanks!

question 1 and 3 are related. A contact exists if the distance between a protein node and a compound node is less than 8Å.
I count the number of such pairs that have distance less than 8Å and name this number as "native num contact".(only used during training) and also count "num contact" based on the predicted interaction distance map.
"covers more than 90% of the native interaction" means that the fraction of num contact over native num contact is above 90%.

question 2.
no. this stage is not occurred during training.

Could you please make some detailed explanation about the Eq.(1) and Eq.(2) ?

I want to know what is actual meaning of t_ij and t'_ij in Eq.(1), and why use gated linear transformation on z_ij?

Why the self-attention can model the Excluded-volume and saturation effect in Eq(2)?

Thanks!

t_ij, t'_ij is basically the z_ij embedding in previous stack with some transformation. you could use other transformation that you deem fit.
Excluded-volume means that a protein node A is unlikely to have interaction with a lot of other compound nodes simultaneously. and self-attention on a z_ij I believe can learn this.

question 1 and 3 are related. A contact exists if the distance between a protein node and a compound node is less than 8Å.
I count the number of such pairs that have distances less than 8Å and name this number as "native num contact".(only used during training) and also count "num contact" based on the predicted interaction distance map.
"covers more than 90% of the native interaction" means that the fraction of num contact over native num contact is above 90%.

Hi, I wonder how you choose the ground-truth functional block for a protein-ligand pair. You mentioned in Appendix J. that, "A protein block encloses the ligand when it covers more than 90% of the native interaction." And since a functional block can cover about 200 amino acids. I think the functional blocks would be constructed with some overlaps, right? And thus, there would be multiple blocks that satisfy the above-mentioned condition. Curious about how you deal with such a situation. Thank you!

Yes, a block will be treated as the "native block" as long as it cover more than 90% of the native interaction. A protein could have multiple native blocks.