Graphs with 1000+ nodes start to become infeasible

Question

Graphs with 1000+ nodes start to become infeasible

Closed this issue 4 months ago · 6 comments

MarcusTomlinson commented 4 months ago

In testing graphs of various sizes, I observed more-or-less the following timings (2021 MacBook M1 Pro):

(x-axis = number of nodes, y-axis = number of seconds to compile*)

* or more specifically, the time it takes for engine.link() to return

Answer 1 · 2024-03-17T10:12:39.000Z

Test code: https://github.com/MarcusTomlinson/cmajor/blob/test/examples/native_apps/HelloCmajor/Main.cpp

Answer 2 · 2024-03-17T10:55:40.000Z

Thanks Marcus - we should profile it and see if there's an obvious hotspot.

Answer 3 · 2024-03-17T14:11:44.000Z

What's going on here is that it's duplicating the processor for each of the passthrough instances in the graph, and then compiling this code that many times, and then attempting to inline it all. I'll look through this and see why the duplication is happening - I think we should be able to avoid this.

Interestingly, the resulting optimisation pass is succeeding in turning it into a sensible bit of code. Here's the output for 1000 nodes (100 parallel branches of 10 nodes):

	.build_version macos, 14, 0
	.globl	_initialise                     ; -- Begin function initialise
	.p2align	2
_initialise:                            ; @initialise
; %bb.0:
	ret
                                        ; -- End function
	.globl	_advanceBlock                   ; -- Begin function advanceBlock
	.p2align	2
_advanceBlock:                          ; @advanceBlock
; %bb.0:
	ldr	w8, [x0]
	cmp	w8, w2
	b.eq	LBB1_3
; %bb.1:                                ; %.lr.ph.preheader
	mov	w9, #100
LBB1_2:                                 ; %.lr.ph
                                        ; =>This Inner Loop Header: Depth=1
	add	x8, x1, w8, sxtw #2
	ldr	w10, [x8]
	mul	w10, w10, w9
	str	w10, [x8, #4096]
	ldr	w8, [x0]
	add	w8, w8, #1
	str	w8, [x0]
	cmp	w8, w2
	b.ne	LBB1_2
LBB1_3:                                 ; %._crit_edge
	str	wzr, [x0]
	ret
                                        ; -- End function```

It's managed to work out that this is basically a *100...

Answer 4 · 2024-03-17T14:15:36.000Z

Yeah the resulting optimised graph is mad impressive. It's very fast - after waiting an hour for it compile that is ;)

Answer 5 · 2024-03-17T16:42:29.000Z

It turns out it's spending significant time in an O(N^2) algorithm to make names unique, and there's basically lots of similar named functions, one per instance of this processor. So, try dropping this into cmaj_AST_Utilities.h to replace UniqueNameList:

template <typename ObjectType, typename ParentType>
struct UniqueNameList
{
    UniqueNameList() = default;
    UniqueNameList (const UniqueNameList&) = delete;

    std::string getName (const ObjectType& o)
    {
        auto& name = names[std::addressof (o)];

        if (name.empty())
        {
            auto root = static_cast<ParentType&> (*this).getRootName (o);

            if (root.empty())
                root = "_";

            auto exists = [this] (const std::string& nameToCheck) -> bool
            {
                for (auto& n : names)
                    if (n.second == nameToCheck)
                        return true;

                return false;
            };

            auto uniqueName = root;
            auto& suffix = suffixes[root];

            if (suffix != 0)
                uniqueName = root + "_" + std::to_string (suffix++);

            while (exists (uniqueName))
                uniqueName = root + "_" + std::to_string (suffix++);

            name = uniqueName;
        }

        return name;
    }

    void clear()
    {
        names.clear();
    }

    std::unordered_map<const ObjectType*, std::string> names;
    std::unordered_map<std::string, uint32_t> suffixes;
};

This takes the runtime for 4000 nodes (500/8) from 275 secs to 9 secs on my machine, and 10,000 nodes (500/20) with this algorithm takes 65 secs.

I'll see about getting a fix something like this into the codebase.

Answer 6 · 2024-03-19T11:40:29.000Z

I've pushed another performance fix, the 500/8 case is now down to 1.8 secs, and the 10,000 nodes to 9 seconds on my machine. I think that's more reasonable given the size of the graph. The next issue to resolve is to move the duplication of the code for this situation.