under-Peter/OMEinsum.jl

Optimal Contraction Order

under-Peter opened this issue · 9 comments

Current implementation in einsum-design #25

Input:

  • tuple of tensors xs
  • tuple of index-label-tuples ixs where ixs[i] label the indices of tensor xs[i]
  • tuple of index-label iy

Output:

  • A list of operations that sequentially applied are optimal under the following constraints:
    1. Outer products between tensors that don't share labels are applied at the end
    2. Permutations from intermediate indices to iy are done in the end
    3. Permutations and Outer products don't enter in the cost-calculation - they are assumed to be unavoidable in the end.
    4. The cost is calculated as the number of iterations a loop-based approach for the operation would have to execute.
    5. Certain operations can be combined, e.g. traces on the same tensor, Diagonal s of two tensors.
    6. If multiple sequences have the same cost, ones with less operations are preferred.

The algorithm is as follows:

edges = edgesfromindices(ixs, iy) # get all indices that are reduced or modified
for edges' in allpermutations(edges)
   ops = operatorsfromedges(edges', ixs, iy) # return a list of operators that edges' corresponds to
   cost = costofops(ops, xs) # calculate cost of evaluating operations ops in sequence
   if cost < mincost # if calculated cost is lowest so far, take that as the optimum
       mincost = cost
       opt = ops
   elseif cost == mincost # if calculated cost is the same as current lowest, take the shorter operator sequence
      if length(ops) < length(opt)
          opt = ops
      end
    end
end

Design decisions

Why decide the operators?

The operators have to be fixed to calculate the cost, since combining operations, e.g. evaluating two traces at once, can be cheaper than calculating them one after the other.
Calculating the operators allows us to combine certain operations but we want to control which:
If we can combine two optimized operations like TensorContract (which uses BLAS under the hood), we want to do that but we don't want to combine a TensorContract with an IndexReduction if it means that instead of calling a fast tensorcontract and sum we need the generic fallback.
Thus even if we test all possible groupings, we can't properly decide an order if we don't consider which operations we want to combine and which to keep separate. The extreme case would be to only use a generic fallback where performance would suffer.

Why ignore the permutation at the end?

I assume that pretty much every operation can be merged with Permutation. E.g. a partial trace and a Permutation can be combined by simply permuting the output-index argument in tensortrace. Changing the output-index order in the Fallback is also trivial.

Why ignore outer products?

Under some circumstances it can be beneficial to evaluate Outer products before the end. But this is complicated and adds a lot of operations to consider (any possible outer product at all times). To get things going I choose to ignore that at first.

Why types in the cost-calculation?

We could chose to enumerate the types and work with integers instead to avoid the overhead of creating typed objects. That would lead to a lot more complicated code since we still would have to decide which operators to combine or not.

Why combine at all?

Not combining operations can lead hugely increased costs. E.g. in ijk,ijk -> ijk, where all indices have dimension d, the cost of the combined operation is d^3 whereas if each index is reduced after another we have the sequence 'ijk,imn -> ijkmn, ijjmn -> ijmn, ijmm -> ijmwhich has a costd^5+d^4+d^3`.
But as mentioned above, we don't want to combine any two operations on the same tensors if it means having to use a slow generic fallback instead of two fast operations.

Where to go from here?

While the algorithm is terribly slow right now, I'm not sure that creating types is the main culprit. As it stands, a lot of type-unstable tuple functions are called and there's little thought in a better structuring of the data and/or algorithms.
So there's certainly a lot to improve but if we decide to to it like numpy.einsum and allow the calculation of optimal order separately from einsum and provide it as an argument even a slow algorithm might be useful.
E.g. calculate an order outside a loop, evaluate it inside a loop.

thoughts, @GiggleLiu ?

Thanks for the explaination

Why decide the operators?

The operators have to be fixed to calculate the cost, since combining operations, e.g. evaluating two traces at once, can be cheaper than calculating them one after the other.

Yes.

Calculating the operators allows us to combine certain operations but we want to control which:
If we can combine two optimized operations like TensorContract (which uses BLAS under the hood), we want to do that but we don't want to combine a TensorContract with an IndexReduction if it means that instead of calling a fast tensorcontract and sum we need the generic fallback.

The only clue for optimization should be complexity, just because it is simpler.
Thanks for the careful thinking, but I believe simplicity values more here, this is the key point. Simpler design usually infers better maintainability, robustness, readability. Although the performance is not optimal. So, BLAS should not be considered here, it makes the story much more complex.
(we need a thorough discussion about this point today.)

We should also notice, there are a lot of room for optimization in the fallback implementaion. In this final version, I believe BLAS is faster by a factor of approximately 10, it does not compete with the assymptotic behavior given by complexity theory. Notice here, what we want is a crudely optimal result (in fact, the optimal approach is not achievable).

This optimisation taking BLAS into consideration should be in v0.3 or later. It is way more complex when taking GPU into consideration. The fact is, most of our research code will run on GPU.

Why ignore the permutation at the end?

I assume that pretty much every operation can be merged with Permutation. E.g. a partial trace and a Permutation can be combined by simply permuting the output-index argument in tensortrace. Changing the output-index order in the Fallback is also trivial.

Same argument as above, simplicity values more. If we take all aspect into consideration, the code will no-longer be maintainable. But thanks for very careful thinking.

Why ignore outer products?

Under some circumstances it can be beneficial to evaluate Outer products before the end. But this is complicated and adds a lot of operations to consider (any possible outer product at all times). To get things going I choose to ignore that at first.

Same as above.

Why types in the cost-calculation?

We could chose to enumerate the types and work with integers instead to avoid the overhead of creating typed objects. That would lead to a lot more complicated code since we still would have to decide which operators to combine or not.

Here we see if we do not consider BLAS, the design is much easier. This is an aspect of maintainability.

Why combine at all?

Not combining operations can lead hugely increased costs. E.g. in ijk,ijk -> ijk, where all indices have dimension d, the cost of the combined operation is d^3 whereas if each index is reduced after another we have the sequence 'ijk,imn -> ijkmn, ijjmn -> ijmn, ijmm -> ijmwhich has a costd^5+d^4+d^3`.
But as mentioned above, we don't want to combine any two operations on the same tensors if it means having to use a slow generic fallback instead of two fast operations.

Agree that we need combining operations.

Where to go from here?

While the algorithm is terribly slow right now, I'm not sure that creating types is the main culprit. As it stands, a lot of type-unstable tuple functions are called and there's little thought in a better structuring of the data and/or algorithms.
So there's certainly a lot to improve but if we decide to to it like numpy.einsum and allow the calculation of optimal order separately from einsum and provide it as an argument even a slow algorithm might be useful.
E.g. calculate an order outside a loop, evaluate it inside a loop.

My opinion is do not consider the contraction order for the moment. It is not worth paying a lot of time finding the optimal contraction order, it requires the knowledge of tree-wdith, and it is not included in the original proposal, although it is a very interesting subject.

If we can dispatch pairwise operations to TensorOperations, dispatch some simple functions like sum, permute and trace when it applies directly. We can solve all the bottlenecks in TRG and CTMRG. It is an easy win with 1-2 days effort.

How do you think? @under-Peter

Following your suggestions I would now just replace optimiseorder with a function that is based on complexity and grouping but doesn't consider the type of operation and hopefully simpler.
The rest could for the moment be unchanged.
ok? @GiggleLiu

Cool, that sounds very good!

(Currently there are some problems with PRs on github so I'll just link the branch).

Here is now an alternative to finding the optimal contraction order based purely on edge contractions.
Once an "optimal" order is found, I use modifyops once to translate the list of edges to a list of operators and group operations if possible.
How and what to group could easily be extended but testing all possible groupings doesn't seem to be necessary.

While it's just a very first draft, is that what you had in mind? @GiggleLiu

edit:
Note that this is basically what I had here: https://github.com/under-Peter/OMEinsum.jl/blob/aac3e731555b8fb0edc82a6385cb1e03ef8c85e7/src/einorder.jl before I added the ability to combine operators because I found some operations to be too slow.

If we can dispatch pairwise operations to TensorOperations, dispatch some simple functions like sum, permute and trace when it applies directly. We can solve all the bottlenecks in TRG and CTMRG. It is an easy win with 1-2 days effort.

And we can already do that, so if we just don't export/advertise einsumopt, we could just go forward and work on TRG or CTMRG, right? Or what changes do you think are required?

That's it. Thanks, it is much easier to read than before.

Still rooms for improvements

  1. do not include modifyops inside optimalorder, do not include operatorfromedge in modifyops, they are typical highly entangled design pattern.
  2. add unit tests for optimal contraction order, and operator generation functions. Using einsumopt for testing is more likely system tests, it can not detect incorrect contraction order. Good unit test indicates well disentangled design.

Since I don't see the tests, I would suspect you have written them as a whole, it can easily give you incorrect results. A probably better way of programming

  1. write specialized evaluate functions, writes tests to make sure it works
  2. write operation type generation functions, write tests to make sure it generates correct types.
  3. write optimalorder, write tests to make sure it does find the optimal order
  4. group the above operations, write system tests.

Especially in collaborating, tests help me understand each building block. So, unit tests are much more than test coverage.