erikbrinkman/d3-dag

sugiyama layout large data would be slow

yuyang2009 opened this issue · 9 comments

the code is below, the data contains 900+ nodes and 1600+ edges. it takes nearly 5 mins to layout the dag.

const dag = d3.dagStratify()(data);
const layout = d3
          .sugiyama() 
          .layering(d3.layeringLongestPath())
          .decross(d3.decrossTwoLayer())
          .coord(d3.coordQuad())
          .nodeSize((node) => {
          const size = node ? base : 5;
          return [1.2 * size, size];
          });
const { width, height } = layout(dag);

Without having the data it's hard to say what's going on. It would probably be good to add a flag to profile the individual steps, even though this can be misleading as a poor layering can increase the time it takes to decross and compute coordinates.

However longest path and two-layer should be linear time, which means it's probably all spend in laying out the coordinates.

To try and make it faster, you can try coordCenter. That should be very fast, so if it's still taking a while, and you can share the data, I'd be happy to help debug / optimize what's happening.

If coordCenter works, it probably won't produce something nice, but it will indicate that coordQuad is causing the slowdown, which is inevitable because it's solving a quadratic program with num nodes variables or more (depending on dummy nodes and others). In this case, you can try coordGreedy. That should scale better than coordQuad, and produce better results than coordCenter. If that's too slow, then it can probably also be optimized, in which case having the data would also help me optimize that.

Without having the data it's hard to say what's going on. It would probably be good to add a flag to profile the individual steps, even though this can be misleading as a poor layering can increase the time it takes to decross and compute coordinates.

However longest path and two-layer should be linear time, which means it's probably all spend in laying out the coordinates.

To try and make it faster, you can try coordCenter. That should be very fast, so if it's still taking a while, and you can share the data, I'd be happy to help debug / optimize what's happening.

If coordCenter works, it probably won't produce something nice, but it will indicate that coordQuad is causing the slowdown, which is inevitable because it's solving a quadratic program with num nodes variables or more (depending on dummy nodes and others). In this case, you can try coordGreedy. That should scale better than coordQuad, and produce better results than coordCenter. If that's too slow, then it can probably also be optimized, in which case having the data would also help me optimize that.

Ths for reply. The coordQuad option is works to me. It takes about 1s to generate the layout by the way.

@yuyang2009 what changed to bring it from 5 minutes to 1 second?

@erikbrinkman I just change the 'coord' option from 'd3.coordQuad()' to 'd3.coordGreedy()'

Erik: If you're interested in a dataset to track this down with, I've run into something that looks related to me. See observablehq.com/d/32c3b2a8612fda16, which is copied from your basic example, but with some tweaks and using my data. If you change the layout to Sugiyama and start cranking up the percentage of the dataset you use it gets slow pretty quickly; 20% -> 17.1 ms, 40% -> 666.6ms, 60% -> 6873.1 ms, and I didn't go above that since I've gotten Observable to hang on me with the larger dataset.

(I have some ideas for using d3-dag for layout, but in an interactive fashion, which means I'm very interested in quick performance. I'll try your suggestion about coordCenter next, but I'll leave that notebook alone so you can poke at it if you want.)

@curiouserrandy In this case, the problem is the coordTopological which in its default implementation is doing something like coordQuad which is slow on large datasets because it's solving a quadratic optimization. If you want to use a sugiyama style topological layout, I can add a more efficient topological sugiyama layout, but it probably only makes sense if that's what you want. If you want to pursue this route, it might be good to open up a feature request for a fast sugiyama topological coordinate assignment operator. For example this is an example notebook where I tweaked your example to use rough node sizes (so the names fit), and a faster topological coordinate system that easily renders the full data, but not super pleasingly. Also note in this example I didn't tweak the arrows to work with ellipse nodes

On a side note, I am generally interested in adapting the library for interactive usage, although I'm not entirely sure what the best approach is. If you're willing to debug or try things, I'd be willing to work on tweaks or api changes that would enable faster interactive dag renderings. If you want to go this route, I would open up an issue about interactive layouts. I expect getting this right night require a non-trivial amount of effort on your part, but I'm willing to do what I can.

@erikbrinkman : Thanks very much for the response and the tweaked notebook! I'm not particularly knowledgeable or biased about specific layout algorithms--I was aiming for Sugiyama because I know it's what Graphviz uses, and I've historically liked the layouts that program gives. Having said that, I haven't been completely happy with d3-dag layouts yet (top of mind is the fact that all my nodes are being ordered in the y direction rather than there being any horizontal parallelism but there may be other things; I've been worrying about getting things working before worrying about aesthetics), so probably the right thing is for me to play around with layout algorithms and decide what I want from a visual perspective, and then re-raise a flag if the performance isn't good along that path.

WRT interactive usage, I'd be happy to try things out and debug them, though I'm sure I'll have some learning curves to climb (specifically around layout algorithms and typescript, which I don't (yet) know). But this is an area I want to be putting some effort into, and helping get d3-dag working interactively is very much aligned with where I want to head. Before I open the interactivity issue, could you give me a sense as to what work you think needs to happen in that space? I had been naively assuming that if we got good performance, d3 animation and updating would do the rest.

@curiouserrandy For your first major issue, that's because you tweaked an observable notebook centered around topological layouts, which in this case, means that the nodes are vertically ordered. This notebook is centered around the gamut of sugiyama layouts, and probably produces something much closer to what you're interested in.

Unfortunately this isn't intended to be a replacement for graph-vis. Like d3, it's some assembly required, with the intention that that bring customizability, so yeah there should be a steeper learning curve, but the default parameters of sugiyama, with appropriate node sizes should work reasonably well.

I agree it makes sense to tackle interactivity after you're able to generate a layout that you like. It may work to just use d3-animation and repeatedly run the layout code. However, there are a few aspects that could make performance better or worse. If you're dynamically adding or removing nodes, the dag structure doesn't support that well, so you'll be wasting some computation to regenerate it every time (or hacking into the internal protected api). If you're just hiding or unhiding nodes, then there may be an even more performant way to regenerate, as long as you have well defined semantics, but figuring out what those are / how to apply them might not be obvious. I forget how d3-hierarchy is meant to handle expandable trees in this case, but it might be a good place to start. Sugiyama doesn't guarantee any stability, so you may be able to animate it, but it may not look as pretty. Aspects of the layout could potentially be tweaks to add stability, or maybe it's not worth it, but it's something to consider. Finally, if the use case is expanding nodes one at a time, it may make more sense to have some meta-layout that allows stably adding nodes. This might allow for using the mor computationally expensive layouts, because they're amortized over each node in user time. None of these is particularly necessary, but they're all things I have thought about, but without grounding to implement.

@erikbrinkman : A quick note, just so that you don't think I've gone radio silent on you :-}. Thanks for the pointer around toplogical layouts--I hadn't done any real investigation of the different options, just cloned a notebook and was playing with it. You're quite right that I want to use one of the more standard layouts.

I remain very interested in working with you on improving d3-dag for interactive use. I haven't filed an interactivity issue yet since I'd like to get to a point where I'm having problems doing something I want to do to inform that issue, but I suspect that'll happen sometime in the next week or two :-}. But if you have changes you'd like me to look at/play with/debug from your end, I'm also happy to do that.

And yes, I'm completely down with "some assembly required"; that's why I want to work in d3 in the first place. There may be places where the underlying functionality doesn't match GraphViz that I may want to engage around (e.g. automatically handling cycles, multiple edges from the same source/dest pair) but I'll wait until they're more clearly in my way before raising them.

Thanks very much for all the help!