KULL-Centre/PRISM

[stability-pipeline] residue missmatch when pdb is cleaned

Closed this issue · 11 comments

If there are missing atoms of residues, PDB-clean will remove those (e.g. residues 22-28 in the example). The check uses the pre-cleaned info, so a mismatch may arise. This was observed using a prism-file as input for an MP call - likely also occurs for soluble proteins.

Bug in the main pipeline. Error rises for check:

File "/lustre/hpc/sbinlab/tiemann/repos/PRISM/PRISM/software/rosetta_ddG_pipeline/structure_input.py", line 224, in make_mutfiles
check = self.fasta_seq[residue_number_ros-1] in list(
IndexError: string index out of range

with previous mismatches:

INFO:Pipeline_logger:Convert prism file: /groups/sbinlab/tiemann/projects/PRISM/debug-files/issue24/output/input/prism_mave_input.txt
2020-09-28 11:28:26 - WARNING - MissmatchK, 57,W
WARNING:Pipeline_logger:MissmatchK, 57,W
2020-09-28 11:28:26 - WARNING - MissmatchG, 143,W
WARNING:Pipeline_logger:MissmatchG, 143,W

Problem during the conversion of prism to mut-file: uses wrongly aligned converter (from checking, not clean).

Example call:

python /groups/sbinlab/tiemann/repos/PRISM/PRISM/software/rosetta_ddG_pipeline/run_pipeline.py \
    --structure /groups/sbinlab/tiemann/projects/PRISM/debug-files/issue36/1bxw-clean.pdb \
    --mutate_mode prism \
    --prism /groups/sbinlab/tiemann/projects/PRISM/debug-files/issue36/prism_mave_103_OmpA_ecoli_unfolding_dG-stability_MP_tmp.txt \
    --outputpath /groups/sbinlab/tiemann/projects/PRISM/debug-files/issue36/output \
    --mode fullrun --chainid A --is_membrane True --mp_calc_span_mode DSSP --mp_align_ref 1bxw_A \
    --mp_prep_align_mode OPM --benchmark_mp_repack 8.0 --benchmark_mp_repeat 5 --benchmark_mp_relax_repeat 1 \
    --benchmark_mp_relax_strucs 1 --slurm_partition sbinlab --overwrite_path True

@andershbf likely needs a discussion about checks. could avoid this using refined pdb-files (e.g. from pdb redo).

Actually it should only remove residues where backbone atoms are missing - could you check what exactly happens?
In any case, the safest is to keep a record of the coordinate sequence as Rosetta reads it, so e.g. after relax, and use that for resfile generation.

I didn't check for all but it def residues where the bb is there. @andershbf did the check-script, so it makes more sense he looks into that and removes/adds those lookup-dicts which are not used/wrong, .... happy to assist/discuss what best to do!

Thanks. It's sort of an independent issue, but if there are residues removed you think are fine (well-defined bb atoms) that seems wrong. Could you post this in Rosetta Slack, with a specific example?

I would like to check with Anders or with more time myself first if it's not somewhere due to something within our pipeline.

sure - doesn't seem urgent, especially so long we don't have external users. Just look into it whenever it becomes a problem.

It might become an unsupervised problem - so everyone who uses the pipeline should check their relax/output structure!

That's generally true, certainly at this stage :) Please put such a note in the README if it's not there already.
It might be good to have the pipeline write the sequence of the coordinates after relax to an easily checked location.

Will do!
A sequence file is written but not one which aligns with the input sequence - so you don't see easily if something is missing.

  • Add note to readme
  • Make alignment in-out pdb sequences

Thanks! I think it's fine to just have the plain sequence, then one can quickly check if it's identical - if yes, all is fine (wrt mutfiles and such at least). If not, what to do will depend on what exactly the issue is, which we will probably need to figure out on a case-by-case basis for now.

Issue is solved - missing atoms are added, we have a checking of the initial sequence alignment and final output (via the prism parser) and a note in README is posted.