mosdef-hub/mbuild

Speeding up add_bond.

chemicalfiend opened this issue · 11 comments

add_bond is realy slow and I think it really should be sped up if mbuild has to support larger systems.

Adding tens of thousands of bonds to a system at a time (which isn't an uncommon process) takes hours on a CPU.

I'll take a look at the speed of adding bonds, I personally haven't run into any speed issues at the moment, even for very large systems.

Recently I put a bit of effort into speeding up the adding of particles (and reworking some of the loader/conversion functions that add particles), see PR: #1107 (basically, instead of adding compounds one at time, add them in as a list). I don't believe this revised functionality is in the current release on conda yet, but should be in the next.

It's possible issue you are encountering may actually be related to the addition of particles, not the bonds, as adding in the particles was definitely incredibly slow as system size increased (due to composing larger and small graphs together). Can you post an example of a system/code that is getting bogged down?

So, over the past few days I've been testing out some add particle and add_bond cases and timing it.

This notebook should serve to illustrate what I mean : https://github.com/chemicalfiend/lammps-carrier-mobility/blob/main/mobility/BBL-Film-Neutral-Dry/gsdwriter.ipynb . If you scroll to the bottom you can see that I've timed adding particles and adding bonds and it takes wayy longer to add bonds. Granted in this case there are about 10,000 more bonds than particles but the order of scaling for how long it took isn't linear and its taking hours longer, whie adding particles was hardly seconds.

I can try using the list thing too and compare it, I don't use the conda release of mbuild anyway.

Thank you for your response @chrisiacovella . I hope this issue helps.

Thanks for uploading the example of what you are doing specifically. That will be very helpful. I have an reasonable idea of what I think is causing the slow down (which should be pretty easy to fix). I'll try to get some examples/workarounds put together soon.

The add_bond function in the Compound basically just calls the add_edge function of networkx (as networkx handles the underlying bond_graph of the Compound data structure). So sadly, the speed seems to be a consequence of networkx and not really anything we can optimize in mbuild. We've talked a bit about potentially switching to graph-tool (which wraps to C++), but that is a significant overhaul, so not happening at this moment.

However, I think your code can be substantially sped up by a small amount of restructuring. Networkx seems to get bogged down as the size of the underlying graph grows (not just in terms of number of elements, but also memory usage associated with it, e.g., a graph of integers will be much faster than a graph that contains all the compound information, even if they contain the same number of elements).

Basically, whether it is adding particles or adding bonds, it is slow to add particles or bonds one at a time to the top level Compound (i.e., what you've called system in your code). It will be much faster to add particles/bonds to smaller groups, e.g., at the level of a molecule, then add the smaller group (i.e., molecule ) to the system. So in the code you posted, the time for adding compounds is pretty fast since you build up the system by adding particles to molecules, then molecules to the overall system, which is the efficient way to do it. i.e.,:

for i in range(num_molecules):
    m = i + 1
    mol = mb.Compound()

    for atom in atoms:
        if atom.mol == m:
            a = mb.Particle(pos=[atom.x, atom.y, atom.z], element = atom.atomtype, name=atom.atomtype, charge= atom.charge)
            mol.add(a)

    system.add(mol)

The slow part is where you loop over the bonds and add them one at a time to the system. i.e.,:

for i in range(len(bondi)):
    #print(f"Adding bond number {i}")
    system.add_bond((system[bondi[i] - 1], system[bondj[i] - 1]))

If you were to combine this bond step into the first one (i.e., adding the bonds into the molecule), things should be much much fast.

To illustrate the time difference, I threw together a naive example of just adding in some random particles, where every 2 consecutive particles are bonded. This implementation should be slow.

for n in [100, 1000, 5000, 10000, 20000]:

    particles = np.zeros(shape=(n,3))
    for i in range(0,n):
        particles[i] = np.random.rand(3)
    bonds = []
        
    for i in range(0,n):
        if i%2 == 0:
            bonds.append([i, i+1])
    #init system
    start = time.time()
    system = mb.Compound()
    compound_list= []
    for i in range(0, particles.shape[0]):
        compound_list.append(mb.Compound(name='C', pos=particles[i]))

    system.add(compound_list)
    for bond in bonds:
        system.add_bond((system[bond[0]], system[bond[1]]))


    end = time.time()

    print("# particles ", system.n_particles, " n_bonds: ",  system.n_bonds, " time: ", round(end-start, 3)) 

The time as number of particles/bonds for this unoptimized code.

# particles  100  n_bonds:  50  time:  0.013
# particles  1000  n_bonds:  500  time:  0.396
# particles  5000  n_bonds:  2500  time:  9.005
# particles  10000  n_bonds:  5000  time:  37.45
# particles  20000  n_bonds:  10000  time:  225.86

If I were to quickly change this such that I initialize molecules and the add particles and bonds to molecules, then add the molecules to the system, the time difference is pretty dramatic. In the code below I just create a temporary graph with networkx to get the connected components (i.e., those bonded in molecules). Since this graph only contains integers, this operation doesn't get bogged down like the graph underlying the Compound. So even throwing ~500K bonds at it, the performance is still good. There are of course other ways to do this that may or may not be faster than the graph (including using a faster graph library).

In the code below, the time for 500K particles is faster than 10k in the naive approach.

# particles  100  n_bonds:  45  time:  0.236
# particles  1000  n_bonds:  495  time:  0.038
# particles  5000  n_bonds:  2495  time:  0.174
# particles  10000  n_bonds:  4995  time:  0.35
# particles  20000  n_bonds:  9995  time:  15.497
# particles  50000  n_bonds:  24995  time:  2.46
# particles  100000  n_bonds:  49995  time:  5.384
# particles  500000  n_bonds:  249995  time:  25.847
# particles  1000000  n_bonds:  499995  time:  87.045
for n in [100, 1000, 5000, 10000, 20000, 50000, 100000]:

    system = mb.Compound()

    #initialize some random particle positions
    particles = np.zeros(shape=(n,3))
    for i in range(0,n):
        particles[i] = np.random.rand(3)
    
    # create a list of bonds
    # we'll just bond 2 consecutive particles together
    # leaving the last 10 out, just to demo how we 
    # might handle unbonded particles
    bonds = []
    for i in range(0,n-10):
        if i%2 == 0:
            bonds.append([i, i+1])

    # start the timing!
    start = time.time()

    # going to create a bond graph that only contains 
    # integer edges so that we can identify which 
    # particles are connected

    graph = nx.Graph()
    for bond in bonds:
        graph.add_edge(bond[0],bond[1])

    # create a quick list that contains a status
    # as to whether a particle has been added to the system
    status = []
    for i in range(0, n): status.append(False)
 
    # It is much faster to pass a list of compounds to the add
    # function, rather than adding one at a time.
    # Adding one at a time gets very slow as the bond_graph of
    # at compounds grows, related to the compose function in network x.
    compound_list= []
    
    # loop over the connected components, i.e., molecules
    for c in nx.connected_components(graph):
        # this will be the component to represent our molecule
        mol = mb.Compound()

        temp = list(c)
        temp.sort() 

        # add compounds to the molecule
        for item in temp:
            mol.add(mb.Compound(name='C', pos=particles[item]))
            status[item] = True
          
        # add bonds to the molecule
        for edge in graph.subgraph(c).edges:
            mol.add_bond(edge)
            
        # add the molecule to the compound list
        compound_list.append(mol)
    
    # take care of any unbonded particles
    for i, item in enumerate(status):
        if item == False:
            compound_list.append(mb.Compound(name='N', pos=particles[item]))
            status = True

    # add the individual compounds to the system compound.
    system.add(compound_list)
    end = time.time()

    print("# particles ", system.n_particles, " n_bonds: ",  system.n_bonds, " time: ", round(end-start, 3)) 

Thank you so much this helps me a lot! I'll close the issue since it looks like going by what you said it's not an issue with mbuild but with dependencies. Appreciate it!

Great. Let us know if you have any other issues. I've been doing some tests and found an even easier way to speed this up.

Basically, the part that is slow is not really network x, but I think indexing into the system.children ordered set. That just seems much slower as the system size grows.

So you could keep your original code and basically just convert the children ordered set to a list, and things should go pretty fast which much less effort.

There are reasons the code uses an ordered set internally, but I think we've added a lot of additional checks that might negate the need for a set as compared to just a simple list. Something for us to discuss in a future dev meeting. But for now, the solution below should be easy and fast.

for n in [100, 1000, 5000, 10000, 20000, 50000, 100000]:

    particles = np.zeros(shape=(n,3))
    for i in range(0,n):
        particles[i] = np.random.rand(3)
    bonds = []
        
    for i in range(0,n):
        if i%2 == 0:
            bonds.append([i, i+1])
    #init system
    start = time.time()
    system = mb.Compound()
    compound_list= []
    for i in range(0, particles.shape[0]):
        compound_list.append(mb.Compound(name='C', pos=particles[i]))
    
    system.add(compound_list)

    # convert the children orderedset into a list for more efficient indexing

    particle_list = list(system.children)
    for bond in bonds:
        system.add_bond((particle_list[bond[0]], particle_list[bond[1]]))


    end = time.time()

    print("# particles ", system.n_particles, " n_bonds: ",  system.n_bonds, " time: ", round(end-start, 3)) 
# particles  100  n_bonds:  50  time:  0.006
# particles  1000  n_bonds:  500  time:  0.075
# particles  5000  n_bonds:  2500  time:  0.102
# particles  10000  n_bonds:  5000  time:  0.173
# particles  20000  n_bonds:  10000  time:  0.365
# particles  50000  n_bonds:  25000  time:  1.04
# particles  100000  n_bonds:  50000  time:  2.582

I terms of modifying mbuild, I think I can remove the part where a compound is added to the bond graph as a node, and instead have it constructed solely based on when bonds are added (since that is the idea of it). Now this would mean the bond graph does not contain any particles that don't have bonds, but that is fine because we keep track of that already via compound.particles(). This would come up in the connected components part, and it should be easy to have that function append the list of particles without bonds to the list of connected components.

Wow, the speedups seem dramatic, I'll try it for sure. Thank you again. I'll keep you posted if I have any other concerns.

FYI, a PR was just merged that swaps Compound.children from an OrderedSet to a list, which speeds up the entire workflow (both adding particles and adding bonds). This is not part of the release yet (although merged into the master branch on GitHub), so if you are install from conda, you should probably just still use the approach above (i.e., just converting children to a list and referencing that list for adding bonds).

Thanks for raising this issue though. I personally did not have any workflows where I added bonds after initializing all particle coordinates, so I never ran into this. This definitely helped to find this serious inefficiency.

#1121

A few technical notes to document here, so the decision making in the PR is not lost:

mBuild was previously using its own code to define an OrderedSet. This seems to be based Hettiger's algorithm, which unfortunately has O(n) indexing, hence the performance issues as system size grows. Swapping this out for ordered-set/OrderedSet vastly improved performance of indexing into the children data structure; specifically O(1) since the code uses a standard list to provide efficient indexing, although that has drawbacks for other features (see the GitHub repo for the project). Ultimately though, we do not necessarily need to use an OrderedSet to store the children. I'm sure initially children was defined as a set to avoid having the same compound added twice, but this is largely unnecessary because the code itself checks to see if a compound has a parent before being added (and will throw and error if a compound already has a parent, rather than adding it to the children, hence it will never add the same compound/particle twice). As such, @daico007 and I agreed that a simple list would be the most efficient as we really weren't using any of the additional functionality provided by having an OrderedSet, so we didn't need the additional overhead. If we find we do need some functionality that comes from having an OrderedSet in some future code, it would be trivial to copy the list of children into an orderedset, at which point we can decide between using the implementation in MoSDeF or the implementation in the ordered-set project, depending on which operations we need to be most efficient (i.e. indexing vs. deletion).

All the tested speedups worked great! Can anything be done to also speed up this part :

system.save("system.gsd")

for large systems, after adding all the bonds and particles, this seems to take up a lot of time too. Is there a way to save individual compounds one at a time or something? Should I start a new issue for this? I think this may be an issue with the gsd package rather than mbuild itself though. @chrisiacovella

Hi @chemicalfiend, this probably warrant an issue by itself, seems like this is the issue with the gsd writer and not the compound structure itself. I can do some profiling and let you know what i find.

Alright I'll start a new issue