faasm/faabric

Potential scheduler bug? nextMsgIdx not updated when scheduling to other hosts

Closed this issue · 1 comments

I'm adapting my compute/storage division-aware scheduler patches for faasm-ndp to the new code, and I noticed that in the blocks of code responsible for scheduling off to other nodes when the current one is busy, the nextMsgIndex used for calculations later is not updated - it was updated in the for loop stepping above, maybe it was missed in a copy-paste?

https://github.com/faasm/faabric/blob/master/src/scheduler/Scheduler.cpp#L287-L391

for (; nextMsgIdx < nLocally; nextMsgIdx++) { // for loop - nextMsgIdx updated
...
remainder -= nOnThisHost; // but nextMsgIdx not updated

From a code review standpoint it seems that there are multiple variables manually tracked: nextMsgIdx, remainder (which should just always be total minus nextMsgIdx), and the actual messages vector. The vector is not mutated, so that's not a problem, but I think replacing either nextMsgIdx or remainder with a formula based on the other one (you could put it in a capture-by-reference local lambda to avoid duplication) would eliminate this bug and also prevent making similar mistakes. Also marking locals like nMessages, isThreads, cores as const would make it clear these are not changing in the function.

const auto remainder = [&](){ return nMessages - nextMsgIdx; };
if (remainder() > 0) {...}

or

if (nextMsgIdx < nMessages) {...} // without remainder

Hmm yes, you're probably right on that one, this iteration of the scheduler is yet to be tested in anger so there are probably a few more issues with it that are yet to be discovered. Although the external interface won't change, the internals of it will get rewritten at least once over the next two or three weeks, so I don't want to spend too much time perfecting this specific implementation. Once the design and logic is stable, I'll certainly incorporate your comments on style and structure as they are very valid. We're running a few experiments this week that should flush out that bug and any others too.