sokrypton/ColabFold

Custom template raises error "hit.name did not start with PDBID_chain"

ajasja opened this issue · 16 comments

Expected Behavior

Colabfold outputs 5 models.

Current Behavior

Error in the Run Prediction cell

Steps to Reproduce (for bugs)

Seq: MLEEELKQLEEELQAIEEQLAQLQWKAQARKEKLAQLKEKLSGPGSPEDEIQQLEEEISQLEQKNSELKE KNQELKYGSGPGDIEQELERAKESIRRLEQEVNQERSRMQYLQTLLEKSGPGQLEDKVEELLSKNYHLEN EVERLKKLVGSGPGLEEELKQLEEELQAIEEQLAQLQWKAQARKEKLAQLKEKLSGPGSPEDEIQQLEEK NSQLKQEISQLEEKNQELKYGSGPGQLEDKVEELLSKNYHLENEVERLKKLVGSGPGSPEDKISQLKEKI QQLKQENQQLEEENSQLEYGSGPGSPEDENSQLEEKISQLKQKNSELKEEIQQLEYGSGPGSPEDKISEL KEENQQLEQKIQQLKEENSQLEYGSGPGDIEQELERAKESIRRLEQEVNQERSRMQYLQTLLEKSGPGSP EDKNSELKEEIQQLEEENQQLEEKISELKYGLEHHHHHHHH

job_name: tet12sn
Set template mode to custom
Upload the template: https://gist.github.com/ajasja/dce9c4f4c26a9f0ab8dabc06f1bcf89b

msa_mode is single_sequence.

ColabFold Output (for bugs)

WARNING: found GPU Tesla K80: limited to total length < 1000
2022-03-14 13:42:10,349 Running colabfold 1.2.0 (7e9952e15e38450d723b77fdc44b433fc6dee66b)
2022-03-14 13:42:10,352 Found 4 citations for tools or databases
2022-03-14 13:42:17,044 Query 1/1: TET12SN_ab478 (length 461)
COMPLETE: 100%|██████████| 150/150 [elapsed: 00:01 remaining: 00:00]
2022-03-14 13:42:20,073 Could not get MSA/templates for TET12SN_ab478: hit.name did not start with PDBID_chain: tet12sn_.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/colabfold/batch.py", line 1124, in run
    host_url,
  File "/usr/local/lib/python3.7/dist-packages/colabfold/batch.py", line 633, in get_msa_and_templates
    query_seqs_unique[index],
  File "/usr/local/lib/python3.7/dist-packages/colabfold/batch.py", line 123, in mk_template
    query_sequence=query_sequence, hits=hhsearch_hits
  File "/usr/local/lib/python3.7/dist-packages/alphafold/data/templates.py", line 902, in get_templates
    kalign_binary_path=self._kalign_binary_path)
  File "/usr/local/lib/python3.7/dist-packages/alphafold/data/templates.py", line 697, in _process_single_hit
    hit_pdb_code, hit_chain_id = _get_pdb_id_and_chain(hit)
  File "/usr/local/lib/python3.7/dist-packages/alphafold/data/templates.py", line 103, in _get_pdb_id_and_chain
    raise ValueError(f'hit.name did not start with PDBID_chain: {hit.name}')
ValueError: hit.name did not start with PDBID_chain: tet12sn_.

Context

Trying to run a difficult structure by adding a new template. (Trying to set this PDB as initial guess)

Your Environment

Running colabfold 1.2.0 (7e9952e)

Some screenshots:
image

image

Also the original template pdb did not have a chain ID, but even if I add it (the fille attached below), I get the same error.
tet12sna.zip

The custom templates require to be in four letter code, e.g. 9xzx.cif and require a _entity_poly_seq entry. For our custom template feature we reuse the alphafold2 mmCIF parsing logic, which uses _entity_poly_seq for its chain information.

In addition to these two rules, _pdbx_audit_revision_history.revision_date information is required to run with your own custom template. See also:
https://github.com/deepmind/alphafold/blob/8cdec53989ce714e9542d304f21cd06a7aa4a30b/alphafold/data/mmcif_parsing.py#L289-L292)

For now, your custom structure file can be used as a template by following the steps below:

  1. save your structure in the pdb format using PyMOL, UCSF Chimera, etc..
  2. convert your pdb to mmCIF format. This web service is useful. https://mmcif.pdbj.org/converter/index.php?l=en _entity_poly_seq entry is automatically generated according to your file.
  3. open the file with a text editor and add _pdbx_audit_revision_history.revision_dateentry at the end of file (after ATOM record). For example,
ATOM 3754 O OXT . HIS A 1 461 ? -135.506 -147.566 115.562 1.00 22.56  -1 461 HIS A OXT 461 HIS A OXT 1
#
loop_
_pdbx_audit_revision_history.ordinal
_pdbx_audit_revision_history.data_content_type
_pdbx_audit_revision_history.major_revision
_pdbx_audit_revision_history.minor_revision
_pdbx_audit_revision_history.revision_date
1 'Structure model' 1 0 1971-01-01
#
  1. rename this mmcif file to be 4-letter code. e.g. 9xzx.cif.

Then I could use your file as custom template (in localcolabfold).

$ colabfold_batch \
  --num-recycle 3 \
  --templates \
  --amber \
  --msa-mode single_sequence \
  --custom-template-path tet12s_template \
  --num-models 2 \
  --model-order 1,2 \
  --overwrite-existing-results \
  tet12s.fasta tet12s_output

WARNING: You are welcome to use the default MSA server, however keep in mind that it's a limited shared resource only capable of processing a few thousand MSAs per day. Please submit jobs only from a single IP address. We reserve the right to limit access to the server case-by-case when usage exceeds fair use.

If you require more MSAs, please host your own API and pass it to `--host-url`
2022-03-17 09:24:39,076 Running colabfold 1.2.0 (1f2d963e7705b0d8c847980e4b1c44243127febb)
2022-03-17 09:24:39,078 Found 5 citations for tools or databases
2022-03-17 09:24:43,068 Query 1/1: tet12sn (length 461)
2022-03-17 09:24:44,089 Sequence 0 found templates: [b'9xzx_A' b'9xzx_A' b'9xzx_A' b'9xzx_A']
2022-03-17 09:24:44,090 Running model_1
2022-03-17 09:27:47,419 model_1 took 181.6s (3 recycles) with pLDDT 78
2022-03-17 09:28:16,970 Running model_2
2022-03-17 09:29:51,333 model_2 took 93.5s (3 recycles) with pLDDT 77.6
2022-03-17 09:30:16,041 reranking models by plddt
2022-03-17 09:30:16,707 Done

Also, this custom template file could be used in AlphaFold2.ipynb.
Screenshot 2022-03-17 9 51 20

Thank you @YoshitakaMo & @martin-steinegger, it works perfectly!

@martin-steinegger Do you want me to close this issue?
The info would be super useful in the readme or directly in then notebook:

"custom" - upload and search own templates (mmCIF format) -->
"custom" - upload and search own templates (mmCIF format, name must be PDB-id like, file must contain _entity_poly_seq and _pdbx_audit_revision_history.revision_date)

I added the help text to the AlphaFold2 notebook and also linked this issue for users to find more info. I will close this issues for now. :)

The web service https://mmcif.pdbj.org/converter/index.php?l=en is sometimes down. For macOS/Linux users who want to use the converter massively from command line on your machine, I made Homebrew Formulae of MAXIT.

# install MAXIT 
$ brew install brewsci/bio/maxit
...
==> Installing maxit from brewsci/bio
==> Pouring maxit--11.100_1.catalina.bottle.tar.gz
🍺  /usr/local/Cellar/maxit/11.100_1: 31 files, 887.7MB

# show the usage
$ maxit
Usage: maxit -input inputfile -output outputfile -o num [ -log logfile ]
  [-o  1: Translate PDB format file to CIF format file]
  [-o  2: Translate CIF format file to PDB format file]
  [-o  8: Translate CIF format file to mmCIF format file]

# sample
$ maxit -input foo.pdb -output bar.cif -o 1

You can convert your custom pdb/cif file instead of using the web service.

I made the same script for chimeraX. But it still does not give the _entity_poly_seq ;/

from glob import glob
def convert_with_chimera(pdb_in, cif_out=None, chimera_bin="C:/bin/chimeraX-dev/bin/ChimeraX-console.exe"):  
    if cif_out is None:
        cif_out = pdb_in.replace('.pdb', '.cif')
    cmd = f"""{chimera_bin}  --exit --nogui --notools --nostatus --cmd "open {pdb_in}; save {cif_out} bestGuess true" """
    !{cmd}

    time_stamp ="""
#
loop_
_pdbx_audit_revision_history.ordinal
_pdbx_audit_revision_history.data_content_type
_pdbx_audit_revision_history.major_revision
_pdbx_audit_revision_history.minor_revision
_pdbx_audit_revision_history.revision_date
1 'Structure model' 1 0 1971-01-01
#   

    """
    with open(cif_out, 'a') as fn:
        fn.write(time_stamp)
    
#convert_with_chimera('aphx.pdb')

@martin-steinegger How are mutliple templates taken into account? I'm trying to give just the template for each edge (6 templates) but I don't see them in the MSA file or sequence coverage plot.

Could I specify which template is used with a custom MSA?

Also is it possible to use more than four templates?
image

Here is how we implemented the custom template feature:
(1) We break the custom mmCIF files into single chains and built a HHsearch database
(2) Search each chain against the database and assign templates following the Af2 procedure.

Currently we do not have a way to assign templates based on a custom MSA. What do you try to do exactly?

@martin-steinegger I have a complicate structure built from 6 coiled-coils (CC) (https://www.nature.com/articles/nbt.3994, https://github.com/NIC-SBI/CC_protein_origami). AF2 can not predict those without an initial structure, but I'm trying to see if only information about CC paring (i.e. just the structure of individual CCs) could be enough to fold the structure. The problem is the pairs are really far apart in primary sequence.
I see now that the templates are per chain, so this might not work...

PS: do custom templates work with multimer as well?

The features for templates per chain are not so easy to develop. Custom templates work with multimer as well but the template information is only assigned to each monomer separately. There is no features that extracts the chain orientation of the complexes directly in multimer.

Hi AlphaFold2 prediction group,

I met some template prediction as well, which is different from the issue others met. The custom pdb file is produced based on your suggestion and named as new1.cif. But when I tried to upload it, the error showed as:


MessageError Traceback (most recent call last)
in
39 custom_template_path = f"{jobname}_template"
40 os.mkdir(custom_template_path)
---> 41 uploaded = files.upload()
42 use_templates = True
43 for fn in uploaded.keys():

3 frames
/usr/local/lib/python3.7/dist-packages/google/colab/_message.py in read_reply_from_input(message_id, timeout_sec)
100 reply.get('colab_msg_id') == message_id):
101 if 'error' in reply:
--> 102 raise MessageError(reply['error'])
103 return reply.get('data', None)
104

MessageError: RangeError: Maximum call stack size exceeded.

May I ask you how to solve this issue? I guess it's something relative to the prediction setting?

Many thanks in advance!
Best
Jun

Withdraw the issue below, as suggested, I post it as a new issue.
by #469.

I encounter a new issue with custom template.

I use the default options to predict an antibody-antigen complex (3 chains in total) with the following input seq:
ELTQSPATLSLSPGERATLSCRASQSVGRNLGWYQQKPGQAPRLLIYDASNRATGIPARFSGSGSGTDFTLTISSLEPEDFAVYYCQARLLLPQTFGQGTKVEIKRTV:EVQLLESGPGLLKPSETLSLTCTVSGGSMINYYWSWIRQPPGERPQWLGHIIYGGTTKYNPSLESRITISRDISKNQFSLRLNSVTAADTAIYYCARVAIGVSGFLNYYYYMDVWGSGTAVTVSS:WNWFDITNK

I use the custom PDB as attached 3fn0.pdb
[3fn0.zip](https://github.com/sokrypton/ColabFold/files/11891220/3fn0.zip)


The prediction output shows the following error message (these are not warnings but errors, as explained next):

2023-06-28 06:37:51,309 Sequence 0 found no templates
2023-06-28 06:37:52,599 Sequence 1 found no templates
2023-06-28 06:37:53,010 Sequence 2 found no templates

I did some tracing with an offline version of ColabFold, indeed, AlphaFold does not like the converted 3fn0.cif (this cif can also be found in the attachment).

To reproduce AlphaFold's error, please see
https://github.com/deepmind/alphafold/issues/788
You can find example code that reads 3fn0.cif and triggers an error in AlphaFold, which leads AF to not use any template.

I initially reported this as an AlphaFold issue, but later I found out that the .cif templates extracted by the non-custom-template method works, so AF's code is bug-free. My current believe is there is something in the ColabFold-generated 4fn0.cif that breaks DeepMind's code.

Really appreciate if you can help take a look at it. With this bug, the template I provided cannot be used by the prediction.

@data2code It's probably best if you open a new issue.