luwei0917/TankBind

difference of reporting experiment results

Opened this issue · 27 comments

There is difference between reporting experiment results of paper and results from github code.
Is the paper results from TBind v1.0.1, the github from TBind v0.5.0 ???

the result from the GitHub model should be about the same as the one reported in preprint. we only slightly updated the coordinates generation part of the code. The v1.0.1 is not released yet.

@luwei0917
The model outputs the same affinity value for different structures of the same ligand: RDKit structure, PDBBind structure and predict structures of TankBind.
Is this normal?

Yes. In my opinion, that's the beauty of TankBind. You don't need prior knowledge of the complex structure to predict the affinity.

@luwei0917
That sounds awesome.
I am curious about how tankbind achieve that the predicted affinity is conformation-agnostic, that is, the input of coordinates of ligands do not influence the outputted affinity;
If so, why can tankbind update these coordinates of ligands to improve the docked pose of ligands.

The coordinates of the ligand is determined based on its interaction with the protein and constraint imposed by chemical bonds and excluded volume effects.

@luwei0917

  1. predict coord
    D^c_{j,k} in Equation (5) is the group truth distance from PDBBind?
    If so, the label is leaked?
  2. predict affinity
    D_{ij} in L_distance around Equation (4) is the group truth distance from PDBBind?
    If so, no matter what conformations input to the model, the best conformation can always be obtained, and the predict
    affinity is always based on the best conformation ?

It is based on the RDKit generated conformation, not PDBbind when testing or inferencing . It is based on PDBbind conformation during training.
A side note, even during training, under the self-dock setting, we also impose a LAS mask to the D^c_{j,k}.
2.
This is computing loss for training the model. We want our predicted D_{ij} to be as close to the ground truth as possible.
Affinity prediction is always based on the predicted interaction embedding.

The conformation of the ligand is predicted, not given.

@luwei0917

  1. "It is based on the RDKit generated conformation when testing or inferencing." This will make the conformation closer to RDKit, i.e., farther to PDB conformation?

The LAS mask exists in self dock. therefore, only the basic constraint is imposed. The general conformation is determined by the interaction.

@luwei0917
Is TankBind SE(3)-equivariant?

yes. I think so. relative distance is SE(3)-equivariant.

The LAS mask exists in self dock. therefore, only the basic constraint is imposed. The general conformation is determined by the interaction.

LAS includes rings, and the N-O-C ring in the predict example of your code, however, the N-O-C ring is not rigid, which should not in the LAS mask ?

protein's GVP embedding and ligand's torchdrug feature, is these features invariant or equivariant?

I used "Chem.GetSymmSSSR" function in RDKit to get the ring. Could be that some ring should be rigid but considered flexible, or other way around. If you have better solution, please let me know.

I used "Chem.GetSymmSSSR" function in RDKit to get the ring. Could be that some ring should be rigid but considered flexible, or other way around. If you have better solution, please let me know.

Chem.GetSymmSSSR is used to obtain symmetry smallest ring, not necessarily rigid. So we could choose from "Chem.GetSymmSSSR" rings that have only one RDKit conformation?

Right, but how could I know if the ring only have single RDKit conformation?

from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

def generate_conformation(mol):
    mol = Chem.AddHs(mol)
    ps = AllChem.ETKDGv2()
    try:
        rid = AllChem.EmbedMolecule(mol, ps)
        AllChem.MMFFOptimizeMolecule(mol, maxIters=500, confId=0)
    except:
        mol.Compute2DCoords()
    mol = Chem.RemoveHs(mol)
    return mol

def generate_coords(smiles = 'c1ccccc1'):
    mol_from_rdkit = Chem.MolFromSmiles(smiles)
    mol_from_rdkit = generate_conformation(mol_from_rdkit)
    coords = mol_from_rdkit.GetConformer().GetPositions()
    return coords
    
def rigid_transform_3d(A, B):
    """ Find the rotation  and translation matrix of point set A and B
    :param A: A 3d "source" vector
    :param B: A 3d "destination" vector
    :return R: A rotation matrix (3x3)
    :return t: A translation vector
    Align A to B: np.matmul(A, R.T) + t.reshape([1, 3])
    """
    assert len(A) == len(B), "Invalid length"  # Length of two vector should be the same
    N = A.shape[0]  # total points
    centroid_A = np.mean(A, axis=0)
    centroid_B = np.mean(B, axis=0)
    # centre the points
    AA = A - np.tile(centroid_A, (N, 1))
    BB = B - np.tile(centroid_B, (N, 1))
    H = np.matmul(np.transpose(AA), BB)
    U, S, Vt = np.linalg.svd(H)
    R = np.matmul(Vt.T, U.T)
    # special reflection case
    if np.linalg.det(R) < 0:
        print("Reflection detected")
        Vt[2, :] *= -1
        R = np.matmul(Vt.T, U.T)
    t = -np.matmul(R, centroid_A) + centroid_B
    return R, t
    
def get_err(smiles = 'c1ccccc1'):
    A, B = generate_coords(smiles), generate_coords(smiles)
    R, t = rigid_transform_3d(A, B)
    err = B - np.matmul(A,R.T) - t.reshape([1, 3])
    return np.mean(np.abs(err))
    
def rigid_ring(smiles = 'c1ccccc1'):
    # get_err('c1ccccc1')  # 1e-6  
    # get_err('C1CNCOC1')  # 0.158  
    if get_err(smiles) > 1e-3:
        return False
    else:
        return True

the result from the GitHub model should be about the same as the one reported in preprint. we only slightly updated the coordinates generation part of the code. The v1.0.1 is not released yet.

The checkpoint and predict code cannot reproduce the results in the paper, especially the 'mean' column in Table 1. Have you test before? Any suggestions?
@luwei0917

Thanks for the "rigid_ring" function.
The number should be about the same, we tested it before.
I guess it could be atom index misalignment. the torchdrug input is the smiles. so the index might be different from the input sdf.
I used following script to re-order atoms.
sm = Chem.MolToSmiles(mol)
m_order = list(mol.GetPropsAsDict(includePrivate=True, includeComputed=True)['_smilesAtomOutputOrder'])
mol = Chem.RenumberAtoms(mol, m_order)

If this is not the problem, can you send me your folder/scripts to wei.lu@galixir.com and I will take a look.

I have solved the re-order problem, but still can't reproduce the result. The email with the code and the reproduced results has been sent to your email address.

Hope to get your suggestion.

I have solved the re-order problem, but still can't reproduce the result. The email with the code and the reproduced results has been sent to your email address.

Hope to get your suggestion.

Thanks for your suggestion email. There are still some diffs in the new reproduced results and the details have been sent to your email.

Hope to get your reply.

I am hitting analogous issue.
There is a gap between the results we reproduced and those reported in the paper, especially for the metric RMSD.
@Anfankus @yufengwhy Could you reproduce the results later?

@Yuki0613 I uploaded the notebook for evaluating the test set.

@Yuki0613 I uploaded the notebook for evaluating the test set.
I think the main difference is the data process, because if i use processed data of mine, there is still the performance diff. @Yuki0613
Can the code to process pdbbind to the test data be provided? @luwei0917

  1. I use the data process code in this code base, but got 3265 pocket samples in the test set, however your notebook shows 2879 pocket samples.
  2. what's the process of 'renumber_atom_index_same_as_smiles' in the notebook?

I used top10 p2rank pockets. you probably used all pockets.
"renumber_atom_index_same_as_smiles" folder is created by cell10 of
https://github.com/luwei0917/TankBind/blob/ed4c87f47b8268e7ee4f4dc824656fdfe6883593/examples/construction_PDBbind_training_and_test_dataset.ipynb

Can the training code be updated?
For example in the data.py, many names of pt files are mismatched with those in contruction_dataset notebook.
Moreover, there are two compound_dict in new_dataset.compound_dict and all_pocket_test.compound_dict, but these is only one compound_dict in the contruction_dataset notebook?
@luwei0917

Can the training code be updated? For example in the data.py, many names of pt files are mismatched with those in contruction_dataset notebook. Moreover, there are two compound_dict in new_dataset.compound_dict and all_pocket_test.compound_dict, but these is only one compound_dict in the contruction_dataset notebook? @luwei0917

I think new_dataset.compound_dict is built by construction_PDBbind_training_and_test_dataset.ipynb notebook, which is named compound_torchdrug_features.pt. The other two dicts (should be the same), all_pocket_test.compound_dict and all_pocket_test.compound_dict, is built by testset_evaluation_cleaned.ipynb notebook, which is named pdbbind_test_compound_dict_based_on_rdkit.pt.