lhatsk/AlphaLink

Problem with Crosslinking data input

Li-dacheng opened this issue · 12 comments

When I was reproducing the results of CDK in the test_set, you provided input data in the form of crosslink data in both CSV and PT file formats. I noticed that in the PT file, the xl_array contains duplicated entries for residueTo and residueFrom. Can you explain why these entries are duplicated in reverse order?
Additionally, could you clarify the information represented by the grouping_array?
Furthermore, the results I inferred from these inputs do not match the PDB file located at test_set/CDK/predictions/CDK_neff10_1h01_xl_model_5_ptm.pdb, specifically in terms of RMSD and TM-score.

this is my call script:
python predict_with_crosslinks.py test_set/CDK/fasta/CDK.fasta test_set/CDK/crosslinks/1h01_xl.pt --features test_set/CDK/features/CDK_neff10.pkl --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt --uniref90_database_path /xxx/uniref90.fasta --mgnify_database_path /xxx/mgnify/mgy_clusters_2022_05.fa --pdb70_database_path /xxx/pdb70 --uniclust30_database_path /xxx/uniref30/

lhatsk commented

The xl array (as well as a contact map or the pair representation) is symmetric that's why you have (i,j) and (j,i). The grouping array is an artefact, it's no longer required in the distogram network. Here, we just assign every crosslink to its own group indicated by an integer.

To reproduce the results, you need to disable all sources of non-determinism, for example, the MSA masking.

Thank you for your response. Based on your explanation, am I correct in understanding that grouping_array doesn't serve any purpose in the model?
Shouldn't using the crosslink data in both PT and CSV formats yield the same results?
Also, the CDK_neff10.pkl file contains MSA features, but since the inference process didn't utilize MSA, it shouldn't affect my reproduction efforts, correct?

I noticed you mentioned the example T1064 in another (issues/13 ). I ran the data as per your instructions, but the resulting PLDDT score doesn't match the displayed 82.371 in the link Additionally, it differs significantly from the TM score in the attached model.cif file. Could you please help me identify the issue?
Below are my input commands and output files.
python predict_with_crosslinks.py \ T1064.fasta \ T1064_8_LEU_10A_CA.pt \ --features T1064.pkl \ --checkpoint_path /AlphaLink/resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt \ --output_dir $output_dir
link

lhatsk commented

Based on your explanation, am I correct in understanding that grouping_array doesn't serve any purpose in the model? Shouldn't using the crosslink data in both PT and CSV formats yield the same results?

It doesn't serve any purpose, but unfortunately, it can still affect the results due to injecting randomness.

Also, the CDK_neff10.pkl file contains MSA features, but since the inference process didn't utilize MSA, it shouldn't affect my reproduction efforts, correct?

What do you mean, it didn't utilize the MSA? For this example, there will not be any random subsampling of the MSAs, since the MSA size is below the threshold, but by default, there is always MSA masking. This would also apply to T1064, you'd need to remove any source of randomness including MSA masking. We removed any non-determinism to make the results comparable to AlphaFold.

What do you mean, it didn't utilize the MSA?

I noticed in predict_with_crosslinks.py that if a PKL file is provided, MSA won't be performed since the PKL file already contains MSA information, is that correct?

This would also apply to T1064, you'd need to remove any source of randomness including MSA masking.

For this example, there will not be any random subsampling of the MSAs, since the MSA size is below the threshold, but by default, there is always MSA masking.

So, when you refer to the random subsampling of the MSAs, what does that mean? Do I need to input the neff parameter? How do I remove MSA masking? Can you give out an example?

We removed any non-determinism to make the results comparable to AlphaFold.

By the way, you trained on model_5_ptm. When comparing with AlphaFold, did you use the results from model_5? Which checkpoint did you use, the one from AlphaFold or OpenFold?

Thank you very much for your patient responses. Looking forward to your reply.

lhatsk commented

What do you mean, it didn't utilize the MSA?

I noticed in predict_with_crosslinks.py that if a PKL file is provided, MSA won't be performed since the PKL file already contains MSA information, is that correct?

Yes, no MSA search will be performed if you supply a pickle file. The pickle already contains all the features, including the MSA. This way the MSA stays fixed (at a given Neff) which will ensure comparability with AlphaFold, since we used exactly the same input features. The only difference is the crosslinks (+ additional training).

So, when you refer to the random subsampling of the MSAs, what does that mean? Do I need to input the neff parameter?

To limit memory consumption, AlphaFold limits the size of the input MSAs. How many MSAs are being used, is defined in the model configuration, see https://github.com/lhatsk/AlphaLink/blob/main/openfold/config_crosslinks.py#L197.

If the MSA is bigger than max_msa_clusters, the MSA is subsampled to max_msa_clusters many sequences and the rest is aggregated in the extraMSA stack.

How do I remove MSA masking? Can you give out an example?

https://github.com/lhatsk/AlphaLink/blob/main/openfold/config_crosslinks.py#L196

Set this to 0.0.

By the way, you trained on model_5_ptm. When comparing with AlphaFold, did you use the results from model_5? Which checkpoint did you use, the one from AlphaFold or OpenFold?

We used the 2.0 AlphaFold weights for model_5_ptm both as a starting point for fine-tuning and in the prediction for AlphaFold. The predictions were made in OpenFold with the AlphaFold weights which produces the same results (or reasonably close) to AlphaFold.

Yes, no MSA search will be performed if you supply a pickle file. The pickle already contains all the features, including the MSA. This way the MSA stays fixed (at a given Neff) which will ensure comparability with AlphaFold, since we used exactly the same input features. The only difference is the crosslinks (+ additional training).

Sorry, OpenFold cannot accept a feature file as input, right? So how do you ensure that you are using exactly the same input? When creating feature files, you mentioned using different 'neff' values. How is this variable controlled when comparing with AlphaFold2?

How do I remove MSA masking? Can you give out an example?https://github.com/lhatsk/AlphaLink/blob/main/openfold/config_crosslinks.py#L196
Set this to 0.0.

Thank you very much for your prompt. After setting this config to 0.0, the result of TM score from the AlphaLink inference increased from 0.365 to 0.8675. Could you please explain why this has such a significant impact?
Do we always need to set masked_msa_replace_fraction to 0.0 when using AlphaLink?
And, when comparing with AlphaFold2, do I also need to set masked_msa_replace_fraction in the OpenFold config to 0.0?

lhatsk commented

Sorry, OpenFold cannot accept a feature file as input, right? So how do you ensure that you are using exactly the same input?

No, not by default, but it's easy to change. I just removed crosslinks from AlphaLink and used the original AlphaFold weights with the same inputs.

When creating feature files, you mentioned using different 'neff' values. How is this variable controlled when comparing with AlphaFold2?

By using the same features which includes the MSA with a fixed Neff.

Thank you very much for your prompt. After setting this config to 0.0, the result of TM score from the AlphaLink inference increased from 0.365 to 0.8675. Could you please explain why this has such a significant impact?

The MSA masking affects the Neff. It randomly removes 15% of the information in the MSA. The effect is obviously much stronger for MSAs that contain little information to begin with (low Neff). Depending on what is subsampled and how well the network is able to reconstruct it, you may end up with a lower/ higher Neff than before. It could now for example be the case that you mask out parts that could help with noise rejection or that remove information that are super complementary to crosslinks, resulting in worse results and more variance. Here, the masking was just unlucky, it could also help.

Do we always need to set masked_msa_replace_fraction to 0.0 when using AlphaLink?

No, I would keep it on for normal usage.

And, when comparing with AlphaFold2, do I also need to set masked_msa_replace_fraction in the OpenFold config to 0.0?

Yes, you should set it to 0.0 to keep the comparison fair for both methods.

By using the same features which includes the MSA with a fixed Neff.

Thank you. I would like to know how the number of effective sequences (Neff) is defined. Did you set the parameter neff=10 when running AlphaLink and AlphaFold2 on the dataset? Is this done to reflect the impact of crosslink data?

I want to ask this question because when I ran MSA with neff=10 on the example 6LKI_B (ma-rap-alink-0001), the results differ from using feature inputs (skipping MSA). The TM scores with the ground truth are 0.8087 and 0.9012, respectively.

lhatsk commented

Thank you. I would like to know how the number of effective sequences (Neff) is defined. Did you set the parameter neff=10 when running AlphaLink and AlphaFold2 on the dataset? Is this done to reflect the impact of crosslink data?

The Neff is defined in the "MSA subsampling" section. We subsampled the MSAs to a given Neff to simulate challenging targets to show the impact of crosslinking MS data.

I want to ask this question because when I ran MSA with neff=10 on the example 6LKI_B (ma-rap-alink-0001), the results differ from using feature inputs (skipping MSA). The TM scores with the ground truth are 0.8087 and 0.9012, respectively.

6LKI is part of the low Neff CAMEO targets, they are already challenging with low Neffs (at most 25, for 6LKI it's 15) therefore we didn't do any MSA subsampling. Your subsampling will further reduce the Neff and make the target harder which likely results in a lower TM-score.

Hello, I noticed in the data_module_xl.py file, specifically at line 24, that you have imported the MSA subsampling module with from openfold.data.msa_subsampling import get_eff, subsample_msa, subsample_msa_sequentially, subsample_msa_random. However, looking further into the file, I didn't find any usage of this module. Could you please explain why it was imported but not used?

data_module_xl.py is not used. It's some legacy stuff that I didn't clean up.