openmopac/mopac

Erroneous TER records with PDBOUT and other bugs with GEO_REF="SELF"

Closed this issue · 8 comments

There's a persisting bug of spurious TER records for protein models which affects the version 22.0.0 as well when using keywords

ADD-H SITE=(SALT) SITE=(COO,NH3,ARG(+),HIS(+),SO4,PO4) PDBOUT

Example can be found here: test.tar.gz.

The input file is S22A3_HUMAN.dat, the erroneous output file is S22A3_HUMAN.pdb. The TER record at line 8700 is not supposed to be there. The keyword PDBOUT clearly is the culprit as there's nothing wrong with the resulting arc file (also included).

Edit:
Another bug:
If I add following keywords to the to the resulting arc-file, I also get odd error

MOZYME EPS=78.4 GNORM=0.0 DUMP=10M OPT LET PDBOUT REORTH PL T=14.00D SCFCRT=1.D-20 MMOK GEO_REF="SELF"8.0

...

        Number of atoms in "... /S22A3_HUMAN.arc" = 8749

        Number of atoms in "... /S22A3_HUMAN.arc" =    2

        Number of atoms in both systems must be the same, unless keyword "0SCF" is present

Then again, I might be wrong of the usage myself, usually three empty lines has been enough, it seems (open)mopac requires four lines before the starting atom...

If I copy the file to arbritrarily chosen "reference.dat" and use GEO_REF="reference.dat", I still get the same error.
But if I take the keyword lines away (keeping three blank lines before the first atom line), it becomes apparent that the culprit is the GEO_REF parsing as I get the following error:

          Number of atoms in "... /S22A3_HUMAN.arc" = 8749

          Number of atoms in "reference.dat" = 8750

          Number of atoms in both systems must be the same, unless keyword "0SCF" is present

Go figure...

Maybe I am losing my marbles here, but usually three first lines has been reserved for keywords. Now openmopac requires four lines, the fourth being an empty one and the model should start from fifth line. When I tried it like this (keywords in the first two lines followed by two empty lines) and with all keywords removed from above mentioned "reference.dat" (three empty lines before the starting atom) it worked like a charm ..

According to the documentation "By default, MOPAC input files contain one line of keywords followed by two lines of comments."

If I were to decide, I would skip the archaic keyword -system completely and pass all keywords in setup file which should be present always (skipping lines commented with # )

The keywords &, +, and ++ on the keyword line can be used to either extend the keyword line to an extra line or cause the second line to also be read as a keyword line instead of a comment line. The keyword parser may be erroneously identifying the + in your SITE=... statement as such a keyword line extender, which would explain this type of behavior. The input parser can be a bit fragile sometimes, and I'll have to carefully examine what might have triggered the incorrect behavior in this instance.

Going forwards, MOPAC has an obligation to maintain its established system of input/output files to maintain backwards compatibility and established user expectations. However, this does not rule out alternate modes of input and output from being added in the future. There are already several options for MOPAC wrappers, such as ASE and QCEngine, and we might eventually release an "official" Python interface to MOPAC at some point. The hardest part about this is exposing output data in a more structured manner. The optional .aux output files are machine-readable, but they don't expose all of the useful outputs produced by MOPAC. Otherwise, outputs are not consistently consolidated in a data structure to expose through an interface yet.

The keywords &, +, and ++ on the keyword line can be used to either extend the keyword line to an extra line or cause the second line to also be read as a keyword line instead of a comment line. The keyword parser may be erroneously identifying the + in your SITE=... statement as such a keyword line extender, which would explain this type of behavior. The input parser can be a bit fragile sometimes, and I'll have to carefully examine what might have triggered the incorrect behavior in this instance.

That was not the case here. The second problem is unrelated to the first one in the original post as there is no ADD-H SITE=(SALT) SITE=(COO,NH3,ARG(+),HIS(+),SO4,PO4) PDBOUT present in the second one.

While not a proper solution to the first problem, the keyword NOTER will prevent TER lines from being written into PDB files by MOPAC.

Just to make sure as my experience with protein modeling and the PDB file format are both limited - TER is intended to denote a separation between distinct protein chains in the PDB file, and it should be placed between two residues. The example that you've provided has no breaks in its protein backbone, thus no TER lines should be printed. Furthermore, the erroneous TER is placed in the middle of a residue rather than between two residues.

Yes, you understood correctly :)

Another bug found with the same files and same MOZYME-run keywords as in the first post except for DUMP which I set as DUMP=60M
In the log I see DUMP=N - RESTART FILE WRITTEN EVERY 60.0 MINUTES REQUESTED, but no restart file has been created after running it 13 hours.

Yet another bug with the parsing.
MOZYME EPS=78.4 GNORM=0.0 DUMP=60M OPT LET NOTER PDBOUT REORTH PL T=100.00D ++
SCFCRT=1.D-20 MMOK GEO_REF="SELF"8.0
the 8.0 bias is not recognized.

[...]

*  GEO_REF    - REFERENCE GEOMETRY IS IN FILE "GABBR2_gaba_b-receptor.arc"
*               (A BIAS OF 8.0 KCAL/MOL/ANGSTROM^2 TOWARDS THE REFERENCE GEOMETRY WILL BE APPLIED)
 
[...]

 * UNRECOGNIZED KEY-WORDS: (8.0)                                              *
 * IF THESE ARE DEBUG KEYWORDS, ADD THE KEYWORD "DEBUG".                      *

Input file attached: input.tar.gz

Dirty solution is to use DEBUG to get into the Number of atoms in both systems must be the same -problem which can be solved as I previously mentioned. This is also caused by my erroneous use(?) of the ++ instead of just one + as you said. The ++ adds one extra line. If I use only one + and the first atom starts at the 5th line, then the calculation starts just fine. Ie:

 MOZYME EPS=78.4 GNORM=0.0 DUMP=10M OPT LET NOTER PDBOUT REORTH PL T=100.00D +
 SCFCRT=1.D-20 MMOK GEO_REF="SELF"8.0 DEBUG

If I use & instead of +, the first atom of the model must start at the fourth line for the computation to run correctly (with the DEBUG).

The originally reported TER bug was fixed in #49 by adjusting when the breaks in the peptide backbone are updated (e.g. after adding hydrogen atoms). The original example now contains a TER at the very end of the PDB file, which is a valid TER location.

The incorrect interaction between ++ and GEO_REF has been fixed in #50.

The issue with DUMP does not seem to be a bug, but instead a misunderstanding of what DUMP does. It records updates to the geometry during geometry relaxation and other geometry-altering processes. In your example, it is likely that no geometry updates are occurring to be stored by DUMP because SCFCRT=1.D-20 is preventing the SCF cycle from converging until the maximum number of SCF iterations has been performed (set by ITRY and 2000 by default). The geometry can only be updated between SCF cycles, and with your settings, the SCF cycle probably won't terminate in 13 hours (I would need timing information to understand this more definitely). I have confirmed that a restart file is being regularly printed for your example when ITRY and DUMP are set to very small values to force the premature termination of the SCF cycle.

Please close this Issue if these bugs are now resolved or update it with a clarification of lingering or related bugs. I will close this Issue myself at some point if there is no further activity on it.

This should be safe to close at this point.