pqcxms idles when SCC not converged
Closed this issue · 2 comments
Hi,
this is an issue for the CID and EI calculations on a NUMA node for the provided tetrahydrofuran (CID) and 2-chloroethanol (EI) example (QCxMS V5.0.2). For some trajectories and error occurs: "SCC not converged". In such a case the pqcxms bash script is still running, but there are no further qcxms childs spawned and no CPU activity is observed. The directory files below state that the process is "running" when it actually is not and just idles for an unlimited time.
JOB TMP.249 STARTED 166 JOBs done.
JOB TMP.25 STARTED 167 JOBs done.
JOB TMP.250 STARTED 168 JOBs done.
JOB TMP.251 STARTED 169 JOBs done.
JOB TMP.252 STARTED 170 JOBs done.
JOB TMP.253 STARTED 171 JOBs done.
JOB TMP.254 STARTED 172 JOBs done.
JOB TMP.255 STARTED 173 JOBs done.
JOB TMP.256 STARTED 174 JOBs done.
JOB TMP.257 STARTED 175 JOBs done.
JOB TMP.258 STARTED 176 JOBs done.
JOB TMP.259 STARTED 177 JOBs done.
JOB TMP.26 STARTED 178 JOBs done.
JOB TMP.260 STARTED 179 JOBs done.
qcxms.out file:
E N T E R I N G M D M O D U L E
Eimp (eV) = 9.7 tauIC (fs) = 944. nstep = 20000
avcycle = 50 more = 250
step time [fs] Epot Ekin Etot error #F eTemp frag. T
0 0. -14.89592 0.0142 -14.88172 0.0000 1 5000. 332.
100 50. -14.88876 0.0333 -14.85542 0.0000 1 5000. 624.
200 100. -14.88375 0.0752 -14.80858 0.0000 1 5000. 858.
300 150. -14.83579 0.0919 -14.74386 0.0000 1 5000. 1141.
400 200. -14.82937 0.1277 -14.70164 0.0000 1 5000. 1489.
500 250. -14.79716 0.1267 -14.67044 0.0000 1 5000. 1795.
600 300. -14.76365 0.1566 -14.60707 0.0000 1 5000. 2054.
700 350. -14.74900 0.1830 -14.56597 0.0000 1 5000. 2347.
800 400. -14.64915 0.1917 -14.45746 0.0000 1 5000. 2613.
900 450. -14.66616 0.2477 -14.41845 0.0000 1 5000. 2877.
1000 500. -14.59358 0.1856 -14.40794 0.0000 2 5000. 5186. 3385.
1100 550. -14.65769 0.2497 -14.40799 0.0000 2 5000. 7880. 3495.
1200 600. -14.65252 0.2511 -14.40144 0.0000 2 5000. 7953. 3250.
1300 650. -14.62494 0.2212 -14.40376 0.0000 2 5000. 6041. 4144.
1400 700. -14.68314 0.2693 -14.41382 0.0000 2 5000. 7975. 4497.
1500 750. -14.70757 0.3098 -14.39781 0.0000 2 5000. 10039. 3726.
1600 800. -14.65471 0.2488 -14.40589 0.0000 2 5000. 7342. 4037.
1700 850. -14.66420 0.2622 -14.40199 0.0000 2 5000. 6625. 5377.
1800 900. -14.77275 0.3612 -14.41153 0.0000 2 5000. 9159. 7832.
1900 950. -14.70104 0.2934 -14.40761 0.0080 2 5000. 6569. 7381.
1936 968. -14.65855 0.2599 -14.39868 0.0014 2 5000. 7122. 4718.
E X I T M D because nothing happens here anymore
Results
average Ekin 0.194914
average Epot -14.721677
average Etot -14.526763
average T 4559.2
average last T 6358.7
fragment assigment list:112221121
computing average fragment structures ...
inter fragment distances (Angst.)
1 2
1 0.00000
2 23.76383 0.00000
computing IPs with XTB2 at (K) 6359.
fragment 1 E(N)= -5.6141 E(I)= -5.1311 IP/EA(eV)= 13.14
SCC not converged
and qcxms.out
=======================================
| |
| S C C calculation |
| |
=======================================
Ncao : 16
Nao : 15
Nel : 14
T(el) : 300.0
intcut : 25.0
scfconv : 0.100E-05
qconv : 0.100E-03
intneglect: 0.100E-07
broydamp : 0.250
Nshell : 7
% non-zero in H: 65.00
iter E dE RMSdq gap omega full diag
1 -8.5660348 -0.856603E+01 0.115E+01 0.02 0.0 T
2 -8.5767755 -0.107406E-01 0.111E+01 0.02 1.0 T
3 -8.3706258 0.206150E+00 0.148E+01 0.75 1.0 T
4 -8.5523940 -0.181768E+00 0.122E+01 0.40 1.0 T
5 -8.4386773 0.113717E+00 0.141E+01 0.64 1.0 T
6 -8.5256499 -0.869726E-01 0.125E+01 0.51 1.0 T
7 -8.5985857 -0.729358E-01 0.977E+00 0.48 1.0 T
..snip
490 -8.7010104 -0.166266E-01 0.784E+00 0.15 1.0 T
491 -8.8700060 -0.168996E+00 0.465E+00 0.08 1.0 T
492 -8.7879357 0.820703E-01 0.585E+00 0.02 1.0 T
493 -8.7165862 0.713495E-01 0.757E+00 0.15 1.0 T
494 -8.8699285 -0.153342E+00 0.461E+00 0.08 1.0 T
495 -8.8637161 0.621237E-02 0.442E+00 0.03 1.0 T
496 -8.7113696 0.152346E+00 0.767E+00 0.15 1.0 T
497 -8.7663967 -0.550271E-01 0.669E+00 0.11 1.0 T
498 -8.8286966 -0.622998E-01 0.524E+00 0.02 1.0 T
499 -8.6974307 0.131266E+00 0.791E+00 0.16 1.0 T
500 -8.8051980 -0.107767E+00 0.602E+00 0.10 1.0 T
501 -8.7490615 0.561365E-01 0.650E+00 0.02 1.0 T
SCC converged in 501 cycles
eigenvalues
# : 1 2 3 4 5 6 7 8
occ. : 2.000 2.000 2.000 2.000 1.999 1.990 1.171 0.840
eps : -35.908 -26.250 -23.201 -21.858 -20.545 -20.472 -20.344 -20.327
# : 9 10 11 12 13 14 15
occ. : 0.000 0.000 0.000 0.000 0.000 0.000 0.000
eps : -11.862 -7.778 -7.773 -7.656 -7.656 -6.309 16.738
SCC not converged
Running the input coordinates again, will not result in any error and the MD is just fine. So the error itself can not be easily reproduced. When the SCC not converged error occurs the last line in the xtb.tmp file and qcxms.out file contain the " SCC not converged" statement. That means the program should write the FINISHED file or the bash script itself needs to be adapted to read the SCC error and trigger the next run. That means the specific convergence error does not need to be resolved, just the program or the pqcxms.sh bash script need to be adapted in a way to handle this error.
This is not a problem when a scheduler is used with a specific termination time, but its an issue with single PCs or workstations or NUMA nodes. The pqcxms.sh script can not process further and just hangs.
Cheers
Tobias
I agree, this looks like a pqcxms issue. The script itself did not change between versions (or between QCEIMS and QCxMS), so this issue might be valid for all previous versions of the program as well.
I will take a look into this and will come back to this. Thanks for bringing this up @tobigithub !
I think this is solved because I did not observe this issue anymore.
T.