ICAMS/lammps-user-pace

PACE is slower than SNAP in LAMMPS

Closed this issue · 1 comments

Hello,

We try to run ML-PACE in LAMMPS and ML-SNAP in LAMMPS. But we found out the ML-SNAP is faster than ML-PACE.
We use "Si" potential file to run simulation in LAMMPS.

LAMMPS version: 2021/9/29 https://github.com/lammps/lammps/releases/tag/stable_29Sep2021
We use an atom1k as the atom file.

Run snap:
# mpiexec --allow-run-as-root -np 8 /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/build/lmp -in /home/sdb/zzhen/2021/TersoffResult/ML_LAMMPS_INPUTS/run_1k_snap_linear.in

Run pace:
# mpiexec --allow-run-as-root -np 8 /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/build/lmp -in /home/sdb/zzhen/2021/TersoffResult/ML_LAMMPS_INPUTS/run_1k_pace_recursive.in

But we found snap is faster than pace.

  1. Run SNAP log in LAMMPS
    LAMMPS (29 Sep 2021)
    OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/lammps-stable_29Sep2021/src/comm.cpp:98)
    using 1 OpenMP thread(s) per MPI task

Initialize simulation

clear
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/lammps-stable_29Sep2021/src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
units metal
dimension 3
boundary p p p
atom_style atomic
atom_modify map array
lattice custom 1 a1 27.211677927400004 .0 .0 a2 .0 27.211677927400004 .0 a3 .0 .0 27.211677927400004 basis .0 .0 .0
Lattice spacing in x,y,z = 27.211678 27.211678 27.211678
read_data /home/sdb/zzhen/2021/TersoffResult/data/1k.data
Reading data file ...
orthogonal box = (0.0000000 0.0000000 0.0000000) to (27.211678 27.211678 27.211678)
2 by 2 by 2 MPI processor grid
reading atoms ...
1000 atoms
read_data CPU = 0.003 seconds
pair_style snap
pair_coeff * * /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si_Zuo_JPCA2020.snapcoeff /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si_Zuo_JPCA2020.snapparam Si
Reading potential file /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si_Zuo_JPCA2020.snapcoeff with DATE: 2020-01-31
SNAP Element = Si, Radius 0.5, Weight 1
Reading potential file /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si_Zuo_JPCA2020.snapparam with DATE: 2020-01-31
SNAP keyword rcutfac 4.9
SNAP keyword twojmax 8
SNAP keyword rfac0 0.99363
SNAP keyword rmin0 0
SNAP keyword quadraticflag 0
SNAP keyword bzeroflag 0

compute 1 all temp
thermo 500
thermo_style custom step temp etotal
balance 1.0 shift xy 100 1.0
Balancing ...
Neighbor list info ...
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6.9
ghost atom cutoff = 6.9
binsize = 3.45, bins = 8 8 8
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair snap, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
rebalancing time: 0.001 seconds
iteration count = 2
initial/final maximal load/proc = 126.00000 126.00000
initial/final imbalance factor = 1.0080000 1.0080000
x cuts: 0.0000000 0.50000000 1.0000000
y cuts: 0.0000000 0.50000000 1.0000000
z cuts: 0.0000000 0.50000000 1.0000000

velocity all create 300.0 498459 rot yes dist gaussian

fix 3 all nvt temp 300 300 0.01

run 1000
Per MPI rank memory allocation (min/avg/max) = 3.714 | 3.715 | 3.716 Mbytes
Step Temp TotEng
0 300 -5377.8035
500 302.41582 -5340.3568
1000 287.71641 -5341.8331
Loop time of 32.7577 on 8 procs for 1000 steps with 1000 atoms

Performance: 2.638 ns/day, 9.099 hours/ns, 30.527 timesteps/s
100.0% CPU use with 8 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 30.963 | 31.449 | 31.734 | 5.1 | 96.00
Neigh | 0 | 0 | 0 | 0.0 | 0.00
Comm | 1.0131 | 1.2982 | 1.7838 | 25.1 | 3.96
Output | 4.3493e-05 | 4.9119e-05 | 6.4329e-05 | 0.0 | 0.00
Modify | 0.0050025 | 0.0051915 | 0.0053836 | 0.2 | 0.02
Other | | 0.005709 | | | 0.02

Nlocal: 125.000 ave 126 max 124 min
Histogram: 4 0 0 0 0 0 0 0 0 4
Nghost: 1040.50 ave 1042 max 1039 min
Histogram: 4 0 0 0 0 0 0 0 0 4
Neighs: 0.00000 ave 0 max 0 min
Histogram: 8 0 0 0 0 0 0 0 0 0
FullNghs: 8750.00 ave 8820 max 8680 min
Histogram: 4 0 0 0 0 0 0 0 0 4

Total # of neighbors = 70000
Ave neighs/atom = 70.000000
Neighbor list builds = 0
Dangerous builds = 0

#write_dump all atom dump.atom.SNAP.end

SIMULATION DONE

print "All done"
All done
Total wall time: 0:00:32`

  1. Run PACE log in LAMMPS
    LAMMPS (29 Sep 2021)
    OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/lammps-stable_29Sep2021/src/comm.cpp:98)
    using 1 OpenMP thread(s) per MPI task

Initialize simulation

clear
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/lammps-stable_29Sep2021/src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
units metal
dimension 3
boundary p p p

atom_style atomic
atom_modify map array

lattice custom 1 a1 27.211677927400004 .0 .0 a2 .0 27.211677927400004 .0 a3 .0 .0 27.211677927400004 basis .0 .0 .0
Lattice spacing in x,y,z = 27.211678 27.211678 27.211678

read_data /home/sdb/zzhen/2021/TersoffResult/data/1k.data
Reading data file ...
orthogonal box = (0.0000000 0.0000000 0.0000000) to (27.211678 27.211678 27.211678)
2 by 2 by 2 MPI processor grid
reading atoms ...
1000 atoms
read_data CPU = 0.003 seconds
pair_style pace recursive
ACE version: 2021.4.9
Recursive evaluator is used
pair_coeff * * /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si_npj_CompMat2021.ace Si
Loading /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si_npj_CompMat2021.ace
Total number of basis functions
Si: 21 (r=1) 6806 (r>1)
Mapping LAMMPS atom type #1(Si) -> ACE species type #0

compute 1 all temp
thermo 500
thermo_style custom step temp etotal
balance 1.0 shift xy 100 1.0
Balancing ...
Neighbor list info ...
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 8.5
ghost atom cutoff = 8.5
binsize = 4.25, bins = 7 7 7
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair pace, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
rebalancing time: 0.000 seconds
iteration count = 2
initial/final maximal load/proc = 126.00000 126.00000
initial/final imbalance factor = 1.0080000 1.0080000
x cuts: 0.0000000 0.50000000 1.0000000
y cuts: 0.0000000 0.50000000 1.0000000
z cuts: 0.0000000 0.50000000 1.0000000

velocity all create 300.0 498459 rot yes dist gaussian

fix 3 all nvt temp 300 300 0.01

run 1000
Per MPI rank memory allocation (min/avg/max) = 3.133 | 3.133 | 3.133 Mbytes
Step Temp TotEng
0 300 -163138.03
500 293.41297 -163102.72
1000 278.83199 -163101.54
Loop time of 103.233 on 8 procs for 1000 steps with 1000 atoms

Performance: 0.837 ns/day, 28.676 hours/ns, 9.687 timesteps/s
100.0% CPU use with 8 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 101.22 | 102.1 | 103.03 | 6.7 | 98.90
Neigh | 0 | 0 | 0 | 0.0 | 0.00
Comm | 0.18689 | 1.116 | 1.9982 | 63.7 | 1.08
Output | 6.2477e-05 | 6.8073e-05 | 9.5093e-05 | 0.0 | 0.00
Modify | 0.0076606 | 0.0078543 | 0.0080231 | 0.2 | 0.01
Other | | 0.008886 | | | 0.01

Nlocal: 125.000 ave 126 max 124 min
Histogram: 4 0 0 0 0 0 0 0 0 4
Nghost: 1404.50 ave 1406 max 1403 min
Histogram: 4 0 0 0 0 0 0 0 0 4
Neighs: 0.00000 ave 0 max 0 min
Histogram: 8 0 0 0 0 0 0 0 0 0
FullNghs: 15250.0 ave 15372 max 15128 min
Histogram: 4 0 0 0 0 0 0 0 0 4

Total # of neighbors = 122000
Ave neighs/atom = 122.00000
Neighbor list builds = 0
Dangerous builds = 0
#print "All done"
Total wall time: 0:01:43

run_1k_pace_recursive.in

Initialize simulation

clear
units metal
dimension 3
boundary p p p

atom_style atomic
atom_modify map array

lattice custom 1 a1 27.211677927400004 .0 .0 a2 .0 27.211677927400004 .0 a3 .0 .0 27.211677927400004 basis .0 .0 .0

read_data /home/sdb/zzhen/2021/TersoffResult/data/1k.data

Interatomc potential

#pair_style tersoff
#pair_coeff * * /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/SiCGe.tersoff Si(D)
#pair_coeff * * /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si.tersoff Si
pair_style pace recursive
pair_coeff * * /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si_npj_CompMat2021.ace Si

compute 1 all temp
thermo 500
thermo_style custom step temp etotal
balance 1.0 shift xy 100 1.0

velocity all create 300.0 498459 rot yes dist gaussian

fix 3 all nvt temp 300 300 0.01

run 1000

print "Final energy pe is $(pe:%10.3f) eV"
print "Final energy per atom: $(pe/atoms:%10.3f) eV/atom"`

run_1k_snap_linear.in

Initialize simulation

clear
units metal
dimension 3
boundary p p p

atom_style atomic
atom_modify map array

lattice custom 1 a1 27.211677927400004 .0 .0 a2 .0 27.211677927400004 .0 a3 .0 .0 27.211677927400004 basis .0 .0 .0

read_data /home/sdb/zzhen/2021/TersoffResult/data/1k.data
pair_style snap
pair_coeff * * /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si_Zuo_JPCA2020.snapcoeff /home/sdb/zzhen/2021/lammps_src/lammps-stable_29Sep2021/potentials/Si_Zuo_JPCA2020.snapparam Si

compute 1 all temp
thermo 500
thermo_style custom step temp etotal
balance 1.0 shift xy 100 1.0

velocity all create 300.0 498459 rot yes dist gaussian

fix 3 all nvt temp 300 300 0.01

run 1000

SIMULATION DONE

print "All done"

print "Final energy pe is $(pe:%10.3f) eV"
print "Final energy per atom: $(pe/atoms:%10.3f) eV/atom"`

And in the paper "Performant implementation of the atomic cluster expansion (PACE) and application to copper and silicon"
In the abstract, you metioned:
We demonstrate that the atomic cluster expansion as implemented in PACE shifts a previously
established Pareto front for machine learning interatomic potentials toward faster and more accurate calculations

We don't know how can make PACE faster than SNAP ? Thank you.

Thank you for the detailed report. The general answer: it depends on the potential. One could have the fast-but-less-accurate or slower-but-more-accurate potentials, that is why they form a Pareto FRONT. Also one should distinguish between linear and non-linear ACE potentials

In npj paper "Performant implementation of the atomic cluster expansion (PACE) and application to copper and silicon" we demonstrate different ACE potentials: series of NON-LINEAR potentials (both for Cu and Si) with up to few hundreds basis functions only (see Table II and Benchmark study in Supplemental Information of npj paper), being trained on small dataset from another paper (Zuo, Y. et al. J. Phys. Chem. A 124, 731–745 (2020).) that are shown on the Pareto front.
The Si potential that you are comparing is another LINEAR potential with 6827 basis functions, being trained on much larger Si dataset with a reported timing 0.80 ms/atom ( well compared to your timing 0.824 ms/atom).

Non-linear ACE potentials allow to achieve better accuracy with the same number of basis functions, that was shown in recent Phys. Rev. Materials 6, 013804). So on the Pareto front from npj paper the Si-ACE achieved better accuracy than other potentials with a timing around 0.1-0.2 ms/atom (compare to your current SNAP timing 0.256 ms/atom).