This is a package is for writing array operations in index notation, such as:
@tullio M[x,y,c] := N[x+i, y+j,c] * K[i,j] # sum over i,j, and create M
@tullio S[x] = P[x,y] * log(Q[x,y] / R[y]) # sum over y, and write into S
@tullio A[i,j] += B[i,k,l] * C[l,j] * D[k,j] # sum over k,l, and add to values in A
@tullio (*) Z[j] := X[ind[k],j] * exp(-Y[k]) # product over k
Used by itself the macro writes ordinary nested loops much like Einsum.@einsum
.
One difference is that it can parse more expressions (such as the convolution M
, and worse).
Another is that it will use multi-threading (via Threads.@spawn
), dividing large enough arrays into blocks.
But it also co-operates with various other packages, provided they are loaded before the macro is called:
-
It can use
LoopVectorization.@avx
to speed many things up. (Disable withavx=false
.) -
It can use
KernelAbstractions.@kernel
to make a GPU version. (Disable withcuda=false
.) -
It can use
TensorOperations.@tensor
on expressions which this understands. (Disable withtensor=false
.) These must be Einstein-convention contractions of one term; none of the examples above qualify.
The macro also tries to provide a gradient for use with Tracker or Zygote.
(Disable with grad=false
, or nograd=A
.) This is done in one of two ways:
-
By default it takes a symbolic derivative of the right hand side expression. When using
@tensor
, this writes another@tensor
expression for each input array, otherwise it simply fills in all the gradient arrays at once. (Only for reductions over+
ormin
/max
.) -
The option
grad=Dual
uses instead ForwardDiff to differentiate the right hand side (only for reductions over+
). This allows for more complicated expressions.
The expression need not be just one line, for example:
@tullio out[x,y,n] := begin # sum over a,b
i = mod(x+a, axes(mat,1))
j = mod(y+b, axes(mat,2))
@inbounds mat[i,j,n] * abs(kern[a,b])
end (x in axes(mat,1), y in axes(mat,2)) grad=Dual
Here the macro cannot infer the range of the output's indices x,y
, so they must be provided explicitly.
(If writing into an existing array, with out[x,y,n] = begin ...
or +=
, then ranges would be taken from there.)
It knows that it should not sum over indices i,j
, but since it can't be sure of their ranges, it will not add @inbounds
in such cases.
It will also not be able to take a symbolic derivative here, but dual numbers will work fine.
Pipe operators |>
and <|
indicate functions to be performed outside the sum, for example:
@tullio lse[j] := log <| exp(mat[i,j]) # vec(log.(sum(exp.(mat), dims=1)))
The option @tullio verbose=true
will cause it to print index ranges, symbolic derivatives,
and notices when it is unable to use the packages mentioned above.
Notation
using Pkg; pkg"add Tullio" # now registered
using Tullio
A = [abs2(i - 11) for i in 1:21]
# Downsample -- range of i is that allowed by both terms:
@tullio D[i] := (A[2i] + A[2i+1])/2 # 1:10 == intersect(1:10, 0:10)
# Shifts -- range of i calculated in terms of that given for j:
@tullio M[i,j] := A[i+j-1] (j in 1:15) # i in 1:7
using OffsetArrays # Convolve a filter:
K = OffsetArray([1,-1,2,-1,1], -2:2)
@tullio C[i] := A[i+j] * K[j] # j ∈ -2:2 implies i ∈ 3:19
# Index by the values in K
@tullio D[i,j] := A[2K[j]+i] ÷ K[j] # extrema(K)==(-1,2) implies i ∈ 3:17
using FFTW # Functions of the indices are OK:
S = [0,1,0,0, 0,0,0,0]
fft(S) ≈ @tullio F[k] := S[x] * exp(-im*pi/8 * (k-1) * x) (k ∈ axes(S,1))
# Finalisers <| or |> are applied after sum:
@tullio N2[j] := sqrt <| M[i,j]^2 # N2 ≈ map(norm, eachcol(M))
@tullio n3 := A[i]^3 |> (_)^(1/3) # n3 ≈ norm(A,3), with _ anon. func.
# Reduction over any function:
@tullio (*) P[i] := A[i+k] (k in 0:2) # product
@tullio (max) X[i,_] := D[i,j] # maximum(D, dims=2), almost
# Access to fields & arrays -- this uses j ∈ eachindex(first(N).c)
N = [(a=i, b=i^2, c=fill(i^3,3)) for i in 1:10]
@tullio T[i,j] := (N[i].a // 1, N[i].c[j])
# Functions which create arrays are evaluated once:
@tullio R[i,j] := abs.((rand(Int8, 5)[i], rand(Int8, 5)[j]))
using NamedDims, AxisKeys # Dimension names, plus pretty printing:
@tullio M[row=i, col=j, z=k] := A[i+j-1] (j in 1:15, k in 1:2)
@tullio S[i] := M[col=j-i, z=k, row=i+1] # sum over j,k
Threads & SIMD
using Tullio, LoopVectorization, NNlib, BenchmarkTools
# Batched matmul with batch index first in B, defined with @avx loops:
bmm_rev(A, B) = @tullio C[i,k,b] := A[i,j,b] * B[b,k,j] # (sum over j)
A = randn(20,30,500); B = randn(500,40,30);
bmm_rev(A, B) ≈ NNlib.batched_mul(A, permutedims(B, (3,2,1))) # true
@btime bmm_rev($A, $B); # 317.526 μs μs, same speed as un-permuted bmm
@btime NNlib.batched_mul($A, permutedims($B, (3,2,1))); # 1.478 ms, with MKL
# Complete reduction, without first materialising X .* log.(Y')
sum_opp(X, Y=X) = @tullio s := X[i,j] * log(Y[j,i])
X = rand(1000,1000);
@btime sum_opp($X) # 499.814 μs (173 allocations: 14.20 KiB)
@btime sum($X .* log.(transpose($X))) # 8.759 ms (2 allocations: 7.63 MiB)
Derivatives & GPU
using Tullio
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]
A = rand(3,40); B = rand(40,500);
A * B ≈ mul(A, B) # true
using Tracker # or Zygote
ΔA = Tracker.gradient((A,B) -> sum(mul(A, B)), A, B)[1]
ΔA ≈ ones(3,500) * B' # true
using CUDA, KernelAbstractions # Now defined with a GPU version:
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]
cu(A * B) ≈ mul(cu(A), cu(B)) # true
cu(ΔA) ≈ Tracker.gradient((A,B) -> sum(mul(A, B)), cu(A), cu(B))[1] # true
# Reduction over min/max:
Tracker.gradient(x -> (@tullio (max) res := x[i]^3), [1,2,3,-2,-1,3])[1]
Larger expressions
mat = zeros(10,10,1); mat[1,1] = 101;
@tullio kern[i,j] := 1/(1+i^2+j^2) (i in -2:2, j in -2:2)
@tullio out[x,y,c] := begin
xi = mod(x+i, axes(mat,1)) # xi = ... means that it won't be summed,
yj = mod(y+j, axes(mat,2))
@inbounds trunc(Int, mat[xi, yj, c] * kern[i,j]) # and disables automatic @inbounds,
end (x in 1:10, y in 1:10) # and prevents range of x from being inferred.
fs = [sin, cos, tan]
xs = randn(3,100)
@tullio ys[r,c] := (fs[r])(xs[r,c]) # works, but not the gradient
function rowmap(fs, xs)
axes(fs,1) == axes(xs,1) || error()
@tullio ys[r,c] := begin
@inbounds f = getindex($fs, r) # not writing fs[r] avoids trying to make Δfs
@inbounds f(xs[r,c]) # assignment f = ... has again disabled @inbounds.
end grad=Dual
end
using Zygote, ForwardDiff
Zygote.gradient(sum∘rowmap, fs, ones(3,2))
[f'(1) for f in fs]
Options
The default setting is:
@tullio threads=true fastmath=true avx=true tensor=true cuda=256 grad=Base verbose=false A[i,j] := ...
threads=false
turns off threading, whilethreads=64^3
sets a threshold size at which to divide the work (replacing the macro's best guess).avx=false
turns off the use ofLoopVectorization
, whileavx=4
inserts@avx unroll=4 for i in ...
.grad=false
turns off gradient calculation, andgrad=Dual
switches it to useForwardDiff
(which must be loaded).nograd=A
turns of the gradient calculation just forA
, andnograd=(A,B,C)
does this for several arrays.tensor=false
turns off the use ofTensorOperations
.- Assignment
xi = ...
removesxi
from the list of indices: its range is note calculated, and it will not be summed over. It also disables@inbounds
since this is now up to you. verbose=true
prints things like the index ranges inferred, and gradient calculations.verbose=2
prints absolutely everything.A[i,j] := ...
makes a new array, whileA[i,j] = ...
andA[i,j] += ...
write into an existing one.A[row=i, col=j] := ...
makes a newNamedDimsArray
.@tullio (*) A[i,j] := ...
is a product, as is@tullio A[i,j] *= ...
.
Implicit:
- Indices without shifts must have the same range everywhere they appear, but those with shifts (even
A[i+0]
) run over the inersection of possible ranges. - Shifted output indices must start at 1, unless
OffsetArrays
is visible in the calling module. - The use of
@avx
, and the calculation of gradients, are switched off by sufficiently complex syntax (such as arrays of arrays). - Gradient hooks are attached for any or all of
ReverseDiff
,Tracker
&Zygote
. These packages need not be loaded when the macro is run. - Gradients are only defined for reductions over
(+)
(default) andmin
,max
. - GPU kernels are only constructed when both
KernelAbstractions
andCuArray
are visible. The defaultcuda=256
is passed tokernel(CUDA(), 256)
. - The CPU kernels from
KernelAbstractions
are called only whenthreads=false
; they are not at present very fast, but perhaps useful for testing.
Extras:
A[i] := i^2 (i in 1:10)
is how you specify a range for indices when this can't be inferred.A[i] := B[i, $col] - C[i, 2]
is how you fix one index to a constant (to preventcol
being summed over).A[i] := $d * B[i]
is the preferred way to include other constants. Note that no gradient is calculated ford
.Tullio.@printgrad (x+y)*log(x/z) x y z
prints out how symbolic derivatives will be done.
Elsewhere
Back-end friends & relatives:
-
LoopVectorization.jl is used here, if available.
-
Gaius.jl and PaddedMatrices.jl build on that.
-
GPUifyLoops.jl and KernelAbstractions.jl generate GPU-compatable kernels.
-
ThreadsX.jl does threaded reductions, and much else.
-
Strided.jl does multi-threaded broadcasting.
Front-end near-lookalikes:
-
Einsum.jl makes simple loops. See tests/einsum.jl where
using Tullio: @einsum
is an almost-seamless replaceement. -
TensorOperations.jl and OMEinsum.jl identify patterns on which they can call various basic operations.
-
TensorCast.jl expresses everything as Julia array operations, broadcasting and reduction. (OMEinsum.jl also treats some cases as a special lazy broadcast-reduction.)
Things you can't run:
-
Tortilla.jl seems to exist, publicly, only in this very nice talk.
-
ArrayMeta.jl was a Julia 0.5 take on some of this.
-
Tokamak.jl was another, see readme here.