Tracktion/tracktion_engine

[Possibly optimalization needed]: GUI freezes and clicks and pops when making structural changes.

Closed this issue · 26 comments

jdieks commented

Detailed steps on how to reproduce the issue

  • Create a big project, e.g. 40-64 tracks with some effects and some routing.
  • make sure you add about 32 or more aux return tracks.
  • Not all tracks have to go to an output at all, only the size of the edit matters.
  • Make sure that the actual inputs used with audio are only 4 or so and output is stereo for easy testing.
  • Get some live audio into tracktion engine, playback probably also works.

You will notice that every change you make, adding FX, changing sends, selecting inputs etc. on any channel (used or unused) will cause clicks and pops on the output. For an app with live usage this is a difficult thing to deal with.
We have now done some profiling in the engine too, a report is attached to this report in a comment below.

This behaviour can also be triggered in waveform. The key is to have about 32 channels or so with an aux return in the project (among tracks with regular plugins) and/or some rackFX that you connect to multiple tracks while monitoring live. When you then add more aux returns to the project or add more tracks to rackFX it starts popping on the ASIO driver.

interestingly, when we get audio in and out of the graph using the Official NDI plugin, these pops are not on the NDI stream. This could be a conincedence and due to extra buffering or that the NDI tracks are processed before the underrun happens.
image

What is the expected behaviour?

The expected behaviour would be that the new graph is calculated, the GUI migth even freeze, but that we don't hear the clicks and pops on the audio thread.

Operating systems

Windows

What versions of the operating systems?

We have tested this on both 10 and 11
We have also tested this with many flavours of ASIO drivers, RME, Dante, Rearoute. All drivers have the same behaviour.

Architectures

x86_64

Stacktrace

No response

Plug-in formats (if applicable)

No response

Plug-in host applications (DAWs) (if applicable)

We use Reaper for the other end, where we are sending and receiving the NDI to/from.

Testing on the develop branch

The bug is present on the develop branch

Code of Conduct

  • I agree to follow the Code of Conduct
jdieks commented

We have now done some profiling too and created a project with 192 tracks with 2 aux returns.
Doing this freezed the GUI for 2-3 whole seconds when a graph rebuild is triggered while the CPU meter is only reporting 8% usage. We also get clicks and pops on the audio.
most of the time is spend in "tracktion::graph::ReturnNode::findSendNodes"

return_node_hotspot.txt

*report also updated to reflect latest findings.

Ok thanks for the profile, that makes things a lot clearer for the ReturnNode::findSendNodes, I have an idea of how to optimise that as I've already prepared some code to order the Nodes for the Node::initialise call, I'll just need to pass the same ordered collection to transform and make use of it there.

Can you provide me with the programmatic code to your example Edit, I want to add it to our benchmarks so I can check what I'm doing does actually improve things.


One thing I don't quite understand however is how this GUI freeze (i.e. the time taken to call transform is causing audio dropouts. This path gets called via prepareToPlay which should be concurrent to the audio callback for the old graph.

Can I check that you're on the tip of develop?
If so, does your profile reveal if the audio callback is blocked during the transform call?

Hi @drowaudio! I'm working with @jdieks on the project described in this issue.

I set up a simple project (using the tip of develop) that demonstrates the problem and can be used as a benchmark: repo

It creates an Edit, adds 200 audio tracks, inserting one instance of a given plugin to each track, then measures how much time it takes to update the playback graph. On my machine it produces the following output (measurements in seconds):

[benchmark] volume = 0.044774
[benchmark] auxsend = 0.0430954
[benchmark] auxreturn = 0.223793
[benchmark] patchbay = 0.16873

As you can see, the graph update takes significantly longer with AuxReturn and PatchBay plugins.
Modifying the code to add two instances of each plugin instead of one makes the difference more obvious:

[benchmark] volume = 0.0569204
[benchmark] auxsend = 0.0498455
[benchmark] auxreturn = 0.492667
[benchmark] patchbay = 0.519134

I have not yet looked into how exactly does this interacts with the audio thread. Maybe the audio thread starts missing deadlines because of overall increased CPU load (playing the old graph while doing a lot of work to build the new one) ?

jdieks commented

@drowaudio If needed we could also create a dawbench style project in Waveform with some tracks that you can duplicate until you notice the problem of clicks/pops on your local computer in waveform itself.
Just let me know if you think that's a good idea.

I think what you've provided in that repo will be perfect for testing this. I'll convert it to one of our Benchmarks and then work on the optimisation.

If that doesn't get rid of the pops, perhaps providing a Waveform project that exhibits the problem would be a help but I'll try to get to this early next week.

Ok, I think I've optimised these a fair bit so they're inline with the other plugins now. Can you grab the tip of develop and see if they've made any change to 1) the time to prepare the graph and 2) the possible clicks and pops.

By my benchmark, auxreturn should be ~10x faster and patchbay about ~6x faster.

If you still have noticeable pauses we might need to widen the profile so let me know if this is the case and we can go from there.

jdieks commented

Thanks for the quick response. We are going to start testing this either end of today or tomorrow.

jdieks commented

Hi Dave,

OK we are seeing more or less the same results as you do. We see +/- 500% better results.
We indeed get a lot better response from the GUI and also from a clicks and pops perspective.
The clicking and popping is a lot better, but it's not gone yet. I know the following is not scientific, but when I have around 120 stereo channels with basic effects turned on (EQ, compressor, Limiter, but also some auxes between them) I see a 30% general CPU usage, which is really good. When I then change the input of 1 channel and force a graph rebuild I still get a click and the CPU usage peaks to 100% on one or two core(s).
Daniel is looking into this to give you a more scientific report.

image
The peaks are when I change an input and the ones that reach 100% are the ones that seem to cause clicks and pops.
I've now seen that the clicks and pops are indeed at points with higher CPU usage, but not perse at points where they reach 100%. Also earlier, on the meter at least. (not sure if the windows meters are peak or average)

Ok, that's good to know the graph rebuild has improved, the next thing to find out is if the graph rebuild is actually blocking the audio thread (which it definitely shouldn't). A profile of that should show what's happening if you filter by the start/end of the graph rebuild. Is the audio thread still running during that period? (You could probably also just check by pausing the debugger).

Also, is it more obvious with smaller buffer sizes and does it go away with larger ones?

Hi Dave!

The audio thread does not seem to be blocked during the graph rebuild, the swap happens as I would expect.
For now, I was mostly looking to understand why the UI is being blocked for so long during the graph update, here is what I have found:

Plugin::getNumOutputChannelsGivenInputs(int) is being called a lot, often multiple times for any given plugin
instance. The default implementation of this will allocate (and then free) a bunch of strings on the heap, just to count
the number of output channel names. Implementing getNumOutputChannelsGivenInputs in my classes derived from Plugin
in a way that does not use the channel names gave me significant improvements.

Do you think similar improvements are possible in the tracktion plugins that call this? In the case of PatchBayPlugin
returning 2 instead of relying on the channel names gives me an approximately 2x speedup when measuring
TransportControl::ensureContextAllocated. This may not be the correct thing to do for the patchbay, but I think it
shows that optimizations might be possible here.

After making sure Plugin::getChannelNames is no longer called during the graph update I looked into what is going on
in the remaining time. Most of the remaining work seems to be split roughly 50-50 between two functions:

+ tracktion::graph::node\_player\_utils::prepareToPlay \(33.46%\)|196 \(0.08%\)
| - tracktion::engine::LevelMeasurerProcessingNode::prepareToPlay \(14.27%\)|4 \(0.00%\)
| - tracktion::graph::createNodeGraph \(13.10%\)|17 \(0.01%\)
| + tracktion::graph::ReturnNode::transform|25048 \(9.62%\)|5 \(0.00%\)
| | + tracktion::graph::ReturnNode::findSendNodes|25043 \(9.62%\)|623 \(0.24%\)
| | | - \[External Code\]|23968 \(9.21%\)|23968 \(9.21%\)
  1. LevelMeasurerProcessingNode::prepareToPlay calls getNodeProperties on various different nodes that also call
    getNodeProperties on other nodes. This goes many levels down so it's difficult to tell but I could not identify any
    obvious hotspots here.

  2. Most of createNodeGraph time seems to be spent in ReturnNode::findSendNodes calling into External Code. Maybe this
    refers to dynamic_cast? At least that is the line highlighted in the VS profiler.
    In my case there are approximately 4000 nodes in the postOrderedNodes vector, most of which are not SendNodes.

Again, I'm not sure if any optimizations are feasible here, what do you think?

For these measurements I have used an Edit that that only works with with the application I'm working on. If you need
more isolated examples that do not depend on anything outside the tracktion codebase please let me know and I will try
to come up with something.

Can I just double check this is what you're seeing after my changes above?

If so, it would be helpful to construct an additional benchmark (in the same UnitTest class I linked to above) that more closely represent your use case. A lot of these issues are about complexity scaling so I need real-world examples.

I will look in to LevelMeasurerProcessingNode::prepareToPlay to see if it can be optimised.

I'm a bit surprised iterating the ordered nodes looking for sends is so expensive but I guess if that runs 100 times over 1000 nodes it could add up.

The problem here is that the function is generic and doesn't know about "send node" so I'm not sure how that could be cached nicely. I'll have a think...

Can I just double check this is what you're seeing after my changes above?

Yes, this is based on the tip of develop.

I will look into creating a benchmark that matches my use case as closely as possible

I think I have managed to set up a benchmark that closely matches what I'm seeing in the real-world use case described in my reply above. Some key observations:

  • Making sure the PatchBays do not call getChannelNames (by returning 2 from getNumOutputChannelsGivenInputs) yields a roughly 2x speedup.
  • About 20% of the remaining time seems to be spent in ReturnNode::findSendNodes calling into __RTDynamicCast
  • Sending all sends to a common Return seems to be around 3x slower compared to each send going to a different return. (maybe I can set things up differently to avoid this)

The code for the benchmark can be found here:
daniel-gergely@b83af69

Thanks, this really helps.
I'm just about to add the benchmark to our CI so we can establish a baseline. Can I just check these two lines with you?
returnPlugin->busNumber.setValue (trackIndex == 0 ? commonBusNumber : trackIndex, nullptr);
Is that right? Why not just use trackIndex the whole time here?

And then with this line:
sendPlugin->busNumber.setValue (commonBusNumber, nullptr);
It doesn't seem to make sense that all the returns are set up for separate busses but the sends are all to a common bus.
I think having them all on a common bus or the track index as a bus would make more sense, at least for this benchmark?


Then I think have some easy wins:

  • 50% of the time is effectively spent in LevelMeasurerProcessingNode::prepareToPlay, and it's almost all to find the latency at the root and then current node. After transform has finished, the latency shouldn't change so we should cache it. First in the root node (which is a SummingNode I think) which is almost all of the CPU here but there are probably other places too.
  • For the dynamic_cast cost, I need to figure out an appropriate place to cache the types of Nodes to reduce the search space and avoid the casting. This will need a but more thought.

I think the part about the bus numbers could be simplified further: let's just give each AuxReturn its own bus based on trackIndex:

returnPlugin->busNumber.setValue(trackIndex, nullptr);

The interesting thing is that there is big difference in the benchmark time depending on how the send plugins are set up.

In the second case below with all sends going to the same return it would make sense to simply not add all the other AuxReturns, but I left those in there just to keep the number of nodes the same for the benchmark.

sendPlugin->busNumber.setValue(trackIndex, nullptr);

// Sending all sends to the same return is approximately 3x more time compared to sending each to a different return:
sendPlugin->busNumber.setValue(0, nullptr);

To give you a real-world example of why one might want to have so many sends going to a single aux return: I have a dedicated audio track for monitoring signals coming from other tracks, from different points in the signal chain (e.g. pre- or post-fader). Adding an AuxSend to each track, each feeding to a common AuxReturn seemed like a straightforward way to implement this.

Then I think have some easy wins:

Sounds good, Thanks for looking into this!

jdieks commented

Hey @drowaudio , just as a quick question. Should we test the current pull request already or do you think it will get more changes soon?

Thanks for the info!

Hi, I think I've got an improvement for the send node lookup. I'll try and get to it at the start of next week but if I can't it will probably be the start of July as I've got a conference this week.

jdieks commented

OK thanks for the update. Looking forward to it! :)

Ok, I had a chance to look at this today and have cached the lookups so they should be effectively free past the first pass now. Can you test with this new version and let me know if it improves things?

If you're still getting glitches, a fresh profile trace would be helpful.
Hope it helps though 🤞

jdieks commented

Oh great, we will test it Monday first thing! We'll let you know how it goes!

I did some testing using commit 867a044b5c2f8dbe2c8be847a7b4bce54a4dbec2 (tip of feature/plugin_optimisations). Have not taken any detailed measurements yet but the graph update time seems to have improved.

However, I wonder if the caching of send nodes in ReturnNode::findSendNodes is correct. After updating to the commit mentioned above, sends no longer work (no output form the AuxReturn plugin). A possible explanation is that the cache passed to findSendNodes is the same TransformCache instance shared by all nodes (see transformNodes in tracktion_Node.h), but findSendNodes only caches send nodes for a specific bus, when it is called for the first time.

Reverting the latest commit or disabling the cache in findSendNodes (leaving only the else branch to search through the nodes each time) seems to fix the problem.

Caching all send nodes and filtering by busID each time after retrieving all send nodes from the cache also seems to work. Of course, it is there might be better solutions, let me know what you think!

if (auto cachedVec = cache.getCachedProperty<std::vector<SendNode*>> (cacheKey))
{
    for (auto send : *cachedVec)
        if (send->getBusID() == busID)
            sends.push_back (send);
}
else
{
    std::vector<SendNode*> allSendNodes;

    for (auto n : postOrderedNodes)
        if (auto send = dynamic_cast<SendNode*> (n))
        {
            if (send->getBusID() == busID)
                sends.push_back (send);

            allSendNodes.push_back (send);
        }

    cache.cacheProperty (cacheKey, allSendNodes);
}

Oh yes, sorry, that's my mistake. The caching process shouldn't check the bus number, it should cache all send nodes.
I think that will be fine.

The other option would be to cache each bus ID separately but that's probably overkill.

Does the above fix also stop the audible glitches you were hearing?

Ok, I've updated the branch/PR. Can you grab the latest and check that works for you please?

jdieks commented

Initial testing looks really good and on our machines the audio clicks seem to have been solved indeed!
We are now giving this to our testers to see how it goes at their side and let you know. This will probably take about a week before I have more news.

jdieks commented

Hi Dave,

I just wanted to tell you that out testers have still not been able to test this in a "production environment" yet.
I will still let you know ASAP how that test went.. Sorry for the delay...

jdieks commented

Hi Dave, sorry that it took so long. Our testers finally came back to us that this issue is indeed fixed.. good news!

Thanks for your help!