qcxms/QCxMS

pqcxms idles when SCC not converged

Closed this issue · 2 comments

Hi,
this is an issue for the CID and EI calculations on a NUMA node for the provided tetrahydrofuran (CID) and 2-chloroethanol (EI) example (QCxMS V5.0.2). For some trajectories and error occurs: "SCC not converged". In such a case the pqcxms bash script is still running, but there are no further qcxms childs spawned and no CPU activity is observed. The directory files below state that the process is "running" when it actually is not and just idles for an unlimited time.

JOB  TMP.249  STARTED   166 JOBs done.
JOB  TMP.25  STARTED   167 JOBs done.
JOB  TMP.250  STARTED   168 JOBs done.
JOB  TMP.251  STARTED   169 JOBs done.
JOB  TMP.252  STARTED   170 JOBs done.
JOB  TMP.253  STARTED   171 JOBs done.
JOB  TMP.254  STARTED   172 JOBs done.
JOB  TMP.255  STARTED   173 JOBs done.
JOB  TMP.256  STARTED   174 JOBs done.
JOB  TMP.257  STARTED   175 JOBs done.
JOB  TMP.258  STARTED   176 JOBs done.
JOB  TMP.259  STARTED   177 JOBs done.
JOB  TMP.26  STARTED   178 JOBs done.
JOB  TMP.260  STARTED   179 JOBs done.

image

qcxms.out file:

             E N T E R I N G   M D   M O D U L E

Eimp (eV) =    9.7     tauIC (fs) =   944.      nstep =   20000

avcycle =   50    more =  250

step   time [fs]    Epot       Ekin       Etot    error  #F   eTemp   frag. T
      0      0.    -14.89592   0.0142   -14.88172  0.0000 1    5000.     332.
    100     50.    -14.88876   0.0333   -14.85542  0.0000 1    5000.     624.
    200    100.    -14.88375   0.0752   -14.80858  0.0000 1    5000.     858.
    300    150.    -14.83579   0.0919   -14.74386  0.0000 1    5000.    1141.
    400    200.    -14.82937   0.1277   -14.70164  0.0000 1    5000.    1489.
    500    250.    -14.79716   0.1267   -14.67044  0.0000 1    5000.    1795.
    600    300.    -14.76365   0.1566   -14.60707  0.0000 1    5000.    2054.
    700    350.    -14.74900   0.1830   -14.56597  0.0000 1    5000.    2347.
    800    400.    -14.64915   0.1917   -14.45746  0.0000 1    5000.    2613.
    900    450.    -14.66616   0.2477   -14.41845  0.0000 1    5000.    2877.
   1000    500.    -14.59358   0.1856   -14.40794  0.0000 2    5000.    5186.  3385.
   1100    550.    -14.65769   0.2497   -14.40799  0.0000 2    5000.    7880.  3495.
   1200    600.    -14.65252   0.2511   -14.40144  0.0000 2    5000.    7953.  3250.
   1300    650.    -14.62494   0.2212   -14.40376  0.0000 2    5000.    6041.  4144.
   1400    700.    -14.68314   0.2693   -14.41382  0.0000 2    5000.    7975.  4497.
   1500    750.    -14.70757   0.3098   -14.39781  0.0000 2    5000.   10039.  3726.
   1600    800.    -14.65471   0.2488   -14.40589  0.0000 2    5000.    7342.  4037.
   1700    850.    -14.66420   0.2622   -14.40199  0.0000 2    5000.    6625.  5377.
   1800    900.    -14.77275   0.3612   -14.41153  0.0000 2    5000.    9159.  7832.
   1900    950.    -14.70104   0.2934   -14.40761  0.0080 2    5000.    6569.  7381.
   1936    968.    -14.65855   0.2599   -14.39868  0.0014 2    5000.    7122.  4718.

        E X I T   M D  because nothing happens here anymore

          Results
 average Ekin     0.194914
 average Epot    -14.721677
 average Etot    -14.526763
 average T           4559.2
 average last T      6358.7
 fragment assigment list:112221121
 computing average fragment structures ...

   inter fragment distances (Angst.)

           1         2

    1   0.00000
    2  23.76383   0.00000

 computing IPs with XTB2 at (K)   6359.
 fragment  1 E(N)=     -5.6141  E(I)=     -5.1311     IP/EA(eV)=   13.14
SCC not converged

and qcxms.out

             =======================================
             |                                     |
             |        S C C  calculation           |
             |                                     |
             =======================================
 Ncao      :           16
 Nao       :           15
 Nel       :           14
 T(el)     :   300.0
 intcut    :    25.0
 scfconv   :  0.100E-05
   qconv   :  0.100E-03
 intneglect:  0.100E-07
 broydamp  :      0.250
 Nshell    :            7
% non-zero in H: 65.00
 iter      E             dE          RMSdq      gap      omega  full diag
   1     -8.5660348 -0.856603E+01  0.115E+01    0.02       0.0  T
   2     -8.5767755 -0.107406E-01  0.111E+01    0.02       1.0  T
   3     -8.3706258  0.206150E+00  0.148E+01    0.75       1.0  T
   4     -8.5523940 -0.181768E+00  0.122E+01    0.40       1.0  T
   5     -8.4386773  0.113717E+00  0.141E+01    0.64       1.0  T
   6     -8.5256499 -0.869726E-01  0.125E+01    0.51       1.0  T
   7     -8.5985857 -0.729358E-01  0.977E+00    0.48       1.0  T
..snip
 490     -8.7010104 -0.166266E-01  0.784E+00    0.15       1.0  T
 491     -8.8700060 -0.168996E+00  0.465E+00    0.08       1.0  T
 492     -8.7879357  0.820703E-01  0.585E+00    0.02       1.0  T
 493     -8.7165862  0.713495E-01  0.757E+00    0.15       1.0  T
 494     -8.8699285 -0.153342E+00  0.461E+00    0.08       1.0  T
 495     -8.8637161  0.621237E-02  0.442E+00    0.03       1.0  T
 496     -8.7113696  0.152346E+00  0.767E+00    0.15       1.0  T
 497     -8.7663967 -0.550271E-01  0.669E+00    0.11       1.0  T
 498     -8.8286966 -0.622998E-01  0.524E+00    0.02       1.0  T
 499     -8.6974307  0.131266E+00  0.791E+00    0.16       1.0  T
 500     -8.8051980 -0.107767E+00  0.602E+00    0.10       1.0  T
 501     -8.7490615  0.561365E-01  0.650E+00    0.02       1.0  T
 SCC converged in          501  cycles

          eigenvalues
 #    :         1        2        3        4        5        6        7        8
 occ. :      2.000    2.000    2.000    2.000    1.999    1.990    1.171    0.840
 eps  :     -35.908  -26.250  -23.201  -21.858  -20.545  -20.472  -20.344  -20.327
 #    :         9       10       11       12       13       14       15
 occ. :      0.000    0.000    0.000    0.000    0.000    0.000    0.000
 eps  :     -11.862   -7.778   -7.773   -7.656   -7.656   -6.309   16.738
 SCC not converged

Running the input coordinates again, will not result in any error and the MD is just fine. So the error itself can not be easily reproduced. When the SCC not converged error occurs the last line in the xtb.tmp file and qcxms.out file contain the " SCC not converged" statement. That means the program should write the FINISHED file or the bash script itself needs to be adapted to read the SCC error and trigger the next run. That means the specific convergence error does not need to be resolved, just the program or the pqcxms.sh bash script need to be adapted in a way to handle this error.

This is not a problem when a scheduler is used with a specific termination time, but its an issue with single PCs or workstations or NUMA nodes. The pqcxms.sh script can not process further and just hangs.

Cheers
Tobias

I agree, this looks like a pqcxms issue. The script itself did not change between versions (or between QCEIMS and QCxMS), so this issue might be valid for all previous versions of the program as well.
I will take a look into this and will come back to this. Thanks for bringing this up @tobigithub !

I think this is solved because I did not observe this issue anymore.
T.