OATML-Markslab/Tranception

Question about non-focus columns and DMS scores

hnisonoff opened this issue · 7 comments

Hi I noticed that for some DMS studies there are EVmutation scores for mutations that do not appear to be in focus columns from the MSAs that you provided. Is EVmutation using a different MSA?

As an example for the BLAT_ECOLX_Stiffler_2015 dataset, EVMutation has unique scores for mutations at position 24:

  mutant  EVmutation
0   H24C   -7.206646
1   H24Y   -5.784716
2   H24W   -5.258699
3   H24V   -5.273463
4   H24T   -3.646145

However, in the MSA file the WT sequence is msiqhfrvalipffaafclpvfahpetlvkvkdaedqlgARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLS RVdagQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGgPKELTAFLHNMGDHVTRL DRWEPelneaiPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGS RGIIAALGPDGKPSrIVVIYTTGSQatmdernrqiaeigaslikhw

Position 24 is a non-match column and is filtered out during MSA processing. In this case how are scores computed for these mutations?

Thanks!

Hi Hunter,

Great question! We are using the same MSAs for all models in in the benchmark, including EVmutation. As you noted, EVmutation and other alignment-based approaches (eg., EVE, DeepSequence, Site Independent) do not typically train (and therefore make predictions) on low-coverage positions. In the first released version of our performance files, we were using the standard approach for these alignment-based models and the scores for EVmutation were only available for sufficiently-covered positions.

However, the ProteinGym benchmarks also include models that are able to score all positions (eg., Tranception, RITA), including the low-coverage ones. As a result, our initial performance files had two sets of model comparisons: one set comparing all models on the subset of well-covered positions; another set comparing the subset of models able to score all positions on all mutants.

We subsequently investigated the effect of training alignment-based models on all positions, not just well-covered ones, as this would allow us to use these models to score all possible (substitution) mutations. We observed in particular that:

  1. The performance of these models (trained on all positions) on the subset of well-covered positions was on average similar to that of the same models trained on sufficiently-covered positions only (for some proteins a bit lower, for some a bit higher -- but similar in aggregate)
  2. The rank ordering of all models on sufficiently-covered positions was nearly identical to the rank ordering of models on all mutants (using the newly-trained versions of alignment-based models on all positions).

Consequently, to make things simpler, we are now only reporting one set of performance numbers for all models on all mutants, leveraging these alignment-based models trained on all positions (we made a note of that in the README).

To reproduce the scores we provide for EVmutation (or other alignment-based models), you would just need to pre-process all ProteinGym MSAs to ignore the low coverage information (ie. capitalize all sequences) and then train/score using the standard approach.

Thanks so much for the explanation!

@pascalnotin sorry I just noticed one other thing. It appears that the sequence weights that you provided are for MSAs with columns removed. Do you happen to have weights for the MSAs used when all positions were considered? Thanks!

Hi @hnisonoff -- I just made these sequence weights (when all positions considered) available on our servers. You may download them as follows:

curl -o MSA_weights_substitutions_all_positions.zip https://marks.hms.harvard.edu/tranception/MSA_weights_substitutions_all_positions.zip
curl -o MSA_weights_indels_all_positions.zip https://marks.hms.harvard.edu/tranception/MSA_weights_indels_all_positions.zip

Please let me know if any issues!

Thank you so much! This saves me a lot of compute.

I think it should be noted this sentence, although correct, can be misleading when running EVE scores.

To reproduce the scores we provide for EVmutation (or other alignment-based models), you would just need to pre-process all ProteinGym MSAs to ignore the low coverage information (ie. capitalize all sequences) and then train/score using the standard approach.

This is because capitalizing the MSA alone will not work if your are having the EVE code base preprocess your MSA directly. Here is the link to the class for preprocessing the MSA.

I think Pascal has previously mentioned this, but one can add two lines of code to evol_indices.py and train_VAE.py in order to predict on non focus columns.

Pass in parameter threshold_focus_cols_frac_gaps=1 at this and this line of code. This will preprocess the MSA to include training and predictions at all non focus positions.

Hope this helps anyone in the future trying to solve this bug!

Take care,
Bryce

That's correct - thank you @brycejoh16!