lhatsk/AlphaLink

hands-on protocol for contacts_to_distograms

Yan-Yan-2020 opened this issue ยท 34 comments

Hi,

Could you please share a hands-on protocol on how we can generate distogram with contact information? as a beginner, it seems hard for me to use the scripts (contacts_to_distograms.py) to build the distogram.

Thank you so much!
Yan

Hi Yan,

To generate distograms with the contacts_to_distograms.py script you need two inputs.

  1. a CSV with your contacts where each row has the following format: ResiduePosition1 ResiduePosition2 FDR
    ResiduePositions are 1-indexed, i.e., starting from 1.
    FDR corresponds to the uncertainty for this particular contact, essentially the level of noise you expect. It's a number between 0 and 1. E.g., 0.05 would mean there is a 5% chance this contact is noisy/ wrong. In principle, you can use an FDR of 0 but this would be more akin to forcing constraints.
    Example:
    10 15 0.05
    30 50 0.2
  2. the cutoff for your contact information in Angstrom (CA-CA).

The script currently only allows one cutoff. If you want to mix multiple cutoffs (which is possible), a quick hack would be to run this script with different cutoffs and concatenate the output files.

The output is another CSV with the distograms per contact pair that you can use for prediction.

Run like this:
python contacts_to_distograms.py --csv my_contact_file --cutoff 10 --output my_distogram_output

Let me know how it's working out with your data, I'm curious!

Kolja

Hi Kolja,

I just tried the contacts_to_distograms.py script. it showed the following ERROR:

22 def get_uniform(cutoff, fdr):
23 d = np.ones(128)
---> 24 maximum_bin = np.argmax(distogram_bins > cutoff)
25 d[:maximum_bin] /= np.sum(d[:maximum_bin])
26 d[:maximum_bin] *= 1 - fdr
TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'

do you know how to fix it?

Thank you.
Yan

Sorry! It's fixed in the new version.

Hi Kolja,
the new script works now.
Thank you so much!
Yan

Hi Kolja,
i have a new problem on the 'preprocessing_distributions.py' script. what is the format of the infile "restraints.csv"? Could you please give an example for the csv file?

Thank you so much!
Yan

specifically, it gave the errror:

cannot parse restraint type in line
('From', 'To', 'mu', 'sigma', 'type')

Thank you.
Yan

hello,
if you run

preprocessing_distributions.py --help

it should give you the format of restraints.csv (I'll fix the way it's displayed and include an example and clarification in the readme)

The file should be made up of lines like (for example)

125,51,10.0,5.0,normal
16,10,10.0,5.0,log-normal

These lines specify

a restraint from resi 125 to resi 51 with a mean distance of 10 Angstrom, standard deviation of 5 and a normal distribution
a restraint from resi 16 to resi 10 with a mean distance of 10, standard deviation of 5 and a log-normal distribution

yes, i did run the preprocessing_distributions.py --help it showed
"residueFrom, residueTo, mean_dist, stdev, distribution_type"

and i changed them to ('From', 'To', 'mu', 'sigma', 'type') in your original script, but it still do not work and showed cannot parse restraint type in line
('From', 'To', 'mu', 'sigma', 'type')

could you attach your restraint.csv file here?

(there should be no header in the file, just the comma-separated list of restraints with the lines like my comment above)

it works now without headers. thank you so much!

Hi,
Again, now I have questions on the predict_with_crosslinks.py script. In the READ.ME protocol, command shows
'python predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08'

i dissected into the following way:

python predict_with_crosslinks.py
--distograms restrains_distribution.csv (#csv file generated by preprocessing_distributions.py script)
--checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt
? 7K3N_A.fasta (#i guess it's the input file? then what script we should run?)
? restraints.csv (#why there is restraints.csv file here? what py script we should run?)
--uniref90_database_path uniref90.fasta
--mgnify_database_path mgy_clusters.fa
--pdb70_database_path pdb70/pdb70
? pdb_mmcif/mmcif_files (#what script we should run? like openfold command format? )
--uniclust30_database_path uniclust30_2018_08/uniclust30_2018_08

i left questions on each line of the commands.

Thank you so much!
Yan

Hi Yan,

--distograms is only a flag, it doesn't take an argument.
restraints.csv then corresponds to your restraints_distribution.csv.
7K3N_A.fasta -- you need to have some sort of FASTA input file with your sequence.
pdb_mmcif/mmcif_files -- yes, follow the OpenFold instructions, also for setting up the databases like uniref90, mgnify, and pdb70.

Kolja

Thank you so much Kolja.

Hi, again the error as following:

"RuntimeError: Error(s) in loading state_dict for AlphaFold:
size mismatch for xl_embedder.linear.weight: copying a param with shape torch.Size([128, 1]) from checkpoint, the shape in current model is torch.Size([128, 128])"

How to easily figure it out?
Thanks
Yan

Hi,
never mind about my last question. i made it work with distogram.pt.

but it showed
' File "/opt/AlphaLink/openfold/np/relax/amber_minimize.py", line 521, in run_pipeline
max_attempts=max_attempts,
File "/opt/AlphaLink/openfold/np/relax/amber_minimize.py", line 459, in _run_one_iteration
raise ValueError(f"Minimization failed after {max_attempts} attempts.")
ValueError: Minimization failed after 100 attempts."

and the prediction gave only one structure model. does it make sense? how to figure it out?

Cheers,
Yan

In the end, you get two models, one relaxed and one unrelaxed. It fails in your case in the relax stage. I haven't run into this issue before. It's probably due to a lot of clashes. You should have a look at the unrelaxed model to see if it makes sense.

Thank you so much Kolja.

Is there a simple way to run multiple different sequences and restraints data at the same time?

Thanks,
Yan

No, not in parallel. Sequentially, the easiest would be to wrap everything in a for-loop in bash.

Okay. thanks.

Hello, This thread helped me to understand the scripts. Is there a sample input for testing:

  • preprocessing_distributions.py - e.g. the csv described here #9 (comment)
  • along with sample input for these predict_with_crosslinks.py

Currently I am working out some issues to build the dockerfile (which is failing on my linux instance, but I should be able to work through). Afterwards, I would like to test with existing input. It appears there is testing data (test_set), but this may be already processed to some degree?

Thanks!

Hello,

Just to clarify: AlphaLink is run with a protein sequence (.fasta format) and a set of distance restraints (usually crosslinking MS data). The restraints can be represented in 2 ways:

-the default representation, which is a space-separated file with ResidueFrom ResidueTo FDR as shown in the ReadMe. In this case, predict_with_crosslinks.py is run with no additional flags. The same information can be given as a pyTorch dictionary with Numpy arrays.
-as distance distributions. In this case, you may either provide the programs with your own distance distributions, or generate some for your restraints using preprocessing_distributions.py . In this case, you will then use the output of this script as the input for predict_with_crosslinks.py, which you run with the --distogram flag.

In addition to these 2 files, you may or may not want to use openfold to generate the msa features. For example, you can run the msa stage of alphafold2 and then take the msa directories coming out of that by pointing to that directory with the --use_precomputed_alignments flag.

In the test directory, you will find the CDK case: there you have pyTorch dictionaries, sequences and precomputed msa files to run predict_with_crosslinks.py. If you want, we can also add the space separated restraint file instead (it looks very much like the one in the ReadMe).

Hope this helps!

Thanks it helps. It would be helpful to have the space separated restraint file for the CDK example (this would be the data informed by crosslinking MS data, correct?). And if I understand, then I do not need the preprocessing_distributions and may follow the 'default representation'? Having those data in text form rather than pyTorch dictionary would be helpful as I am not (yet?) very familiar with the representation and our own data would be closer to the text form.

I uploaded the corresponding CSV files. Note that CDK was a theoretical experiment (proof-of-concept), the links correspond to simulated data. The real data for the membrane set is also in the git (still in PyTorch format though).

Yes, for the CDK example you can use the CSV directly as an input to the photo-AA network.

For your data, you could do the same, if your data is close to 10A. If you need a different cutoff, you would need to use the distogram network. Here you could use the contacts_to_distograms.py to preprocess your CSV data (same format as the input to the network ("default representation")).

Hi,

Could you share some detailed commands to play with -neff flag? should we run it with predict_with_crosslinks.py script?

predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08 --neff

Thanks.
Yan

Neff is a number - the number of effective sequences in the MSA, as described in the AlphaLink paper, the original AlphaFold2 paper (Fig.5) and previous publicaitons. It acts at the MSA level by subsampling MSA to a given number of effective sequences.

The fewer effective sequences, the weaker the MSA evidence will be. Thus, for Neff=10, in your command:

predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08 --neff 10

Hi,
i tried the command and gave the error;

File "/opt/AlphaLink/predict_with_crosslinks.py", line 569, in
main(args)
File "/opt/AlphaLink/predict_with_crosslinks.py", line 421, in main
feature_dict['deletion_matrix'] = feature_dict['deletion_matrix'][indices]
KeyError: 'deletion_matrix'

Any solution/

Thanks,

I pushed a fix. Thanks for reporting the issue!

Hi,
is there positional argument 'pdb_mmcif/mmcif_files' in predict_with_crosslinks.py script? it seems i can not add template_mmcif_dir.

Thanks,
Yan

lhatsk commented

AlphaLink was trained on model_5 and doesn't support templates. I updated the README with the db flags since they were missing.

Hi,

I am quite intrigued by the approach you used for processing input data when training on distograms. Specifically, for static structures, the distance for a selected pair is a fixed value. Could you kindly explain how you transform this value into a distribution for training purposes? Or perhaps, I might have misunderstood the model's methodology. I would greatly appreciate your insights on this matter.

Thank you for your time๏ผ
Lin

Hi Lin,
In different ways. The simplest is just a "contact" distogram, i.e., an uniform distribution. Let's say we want to simulate a true restraint with an upper bound of 10 A. We choose any restraint that is < 10 A in the crystal structure. The distogram then has uniform probability over the bins up to 10 A. We sometimes replaced the uniform distribution with a subsampled "real" distance distribution over X restraints with < 10 A or a distance distribution based on simulated crosslinking data.

Hi Lin, In different ways. The simplest is just a "contact" distogram, i.e., an uniform distribution. Let's say we want to simulate a true restraint with an upper bound of 10 A. We choose any restraint that is < 10 A in the crystal structure. The distogram then has uniform probability over the bins up to 10 A. We sometimes replaced the uniform distribution with a subsampled "real" distance distribution over X restraints with < 10 A or a distance distribution based on simulated crosslinking data.

Hi!
Thank you for your reply. I understand the operation for the "contact" distribution map, but I noticed that you have trained two sets of parameters separately. The first set, "finetuning_model_5_ptm_CACA_10A.pt", should be the one you mentioned that was run with a uniform distribution of 10A. I am curious about how the "finetuning_model_5_ptm_distogram.pt" set was trained. It seems to involve more complex procedures.
Thank you for your time๏ผ
Lin

"finetuning_model_5_ptm_CACA_10A.pt" doesn't use a distogram as the internal representation of the crosslinking data. Here, the input data was simply a contact map.