arporter/habakkuk

Allow for overlap of DIVSD with other operations

arporter opened this issue · 7 comments

It seems that a DIVSD does not exclusively occupy the execution port - i.e. it's possible for an independent MULSD to make use of the multiplication hardware while a DIVSD is in progress.
Unfortunately this doesn't appear to be documented anywhere. We therefore need to be able to produce a performance estimate that allows for this overlapping, maybe with bounds since we don't know just how much the two operations can be overlapped.

In conjunction with this, Agner's published results for 1/throughput of a DIVSD have a range of 8-14 cycles. Currently habakkuk produces a performance estimate using a single value for this throughput but again, we could do with producing bounds on the estimate.

I'm extending this issue to cope with overlapping of ADDSDs with DIVSDs as well.

dag.generate_schedule() does what it says and can probably be left alone. I suspect what is needed is to inspect the generated schedule and identify cases where independent divisions/multiplications are consecutively scheduled on the execution port.

schedule_cost() now correctly allows for the overlap of multiplications with division. However, I'm not sure that having the extra functionality in that routine is the best way of doing this. Requires further thought.

Have created a new Schedule class to encapsulate functionality that was shared between methods in the DirectedAcyclicGraph class. However, I'm still not happy as constructing the schedule modifies the state of the DAG (by flagging nodes as they are executed). Maybe the DAG should have an instance of Schedule as part of its internal state? It then does not break encapsulation if the construction of that Schedule modifies the state of the DAG.

Have extended code so that it will overlap + and - operations with DIVSDs as well. Currently there's no limit to how many it will overlap which is obviously not realistic.

Limits dictionary added. I've guessed at the values to put in it - need to run more microbenchmarks.
I also need a new set of microbenchmarks containing a mixture of both addition and multiplication ops alongside a division. Currently I assume they can all be overlapped but this might not be true.

If I get habakkuk to unroll a loop then, even with the new limits on overlapping, it still overlaps most of the second iteration with the first and thus gets an unrealistically high performance estimate. Presumably the number of operations that can be considered for overlap will be limited by the instruction cache on the core.

PR #35 has been merged to master. Closing issue.