High-Performance Tensor-Vector Multiplication Library (TTV)

Summary

TTV is C++ high-performance tensor-vector multiplication header-only library It provides free C++ functions for parallel computing the mode-q tensor-times-vector product of the general form

$$ \underline{\mathbf{C}} = \underline{\mathbf{A}} \times_q \mathbf{b} \quad :\Leftrightarrow \quad \underline{\mathbf{C}} (i_1, \dots, i_{q-1}, i_{q+1}, \dots, i_p) = \sum_{i_q=1}^{n_q} \underline{\mathbf{A}}({i_1, \dots, i_q, \dots, i_p}) \cdot \mathbf{b}({i_q}). $$

where $q$ is the contraction mode, $\underline{\mathbf{A}}$ and $\underline{\mathbf{C}}$ are tensors of order $p$ and $p-1$ with shapes $\mathbf{n}_a= (n_1,\dots n_{q-1},n_q ,n_{q+1},\dots,n_p)$ and $\mathbf{n}_c = (n_1,\dots,n_{q-1},n_{q+1},\dots,n_p)$, respectively. $\mathbf{b}$ is a vector of length $n_{q}$.

All function implementations are based on the Loops-Over-GEMM (LOG) approach and utilize high-performance GEMV or DOT routines of a BLAS implementation such as OpenBLAS or Intel MKL. Implementation details and runtime behevior of the tensor-vector multiplication functions are described in the research paper article.

Please have a look at the wiki page for more informations about the usage, function interfaces and the setting parameters.

Key Features

Flexibility

Contraction mode q, tensor order p, tensor extents n and tensor layout pi can be chosen at runtime
Supports any non-hierarchical storage format inlcuding the first-order and last-order storage layouts
Offers two high-level and one C-like low-level interfaces for calling the tensor-times-vector multiplication
Implemented independent of a tensor data structure (can be used with std::vector and std::array)
Supports float, double, complex and double complex data types (and more if a BLAS library is not used)

Performance

Multi-threading support with OpenMP
Can be used with and without a BLAS implementation
Performs in-place operations without transposing the tensor - no extra memory needed
For large tensors reaches peak matrix-times-vector performance

Requirements

Requires the tensor elements to be contiguously stored in memory.
Element types must be an arithmetic type suporting multiplication and addition operator

Experiments

The experiments were carried out on a Core i9-7900X Intel Xeon processor with 10 cores and 20 hardware threads running at 3.3 GHz. The source code has been compiled with GCC v7.3 using the highest optimization level -Ofast and -march=native, -pthread and -fopenmp. Parallel execution has been accomplished using GCC ’s implementation of the OpenMP v4.5 specification. We have used the dot and gemv implementation of the OpenBLAS library v0.2.20. The benchmark results of each of the following functions are the average of 10 runs.

The comparison includes three state-of-the-art libraries that implement three different approaches.

TCL (v0.1.1 ) implements the TTGT approach.
TBLIS ( v1.0.0 ) implements the GETT approach.
EIGEN ( v3.3.90 ) sequentially executes the tensor-times-vector in-place.

The experiments have been carried out with asymmetrically-shaped and symmetrically-shaped tensors in order to provide a comprehensive test coverage where the tensor elements are stored according to the first-order storage format. The tensor order of the asymmetrically- and symmetrically-shaped tensors have been varied from 2 to 10 and 2 to 7, respectively. The contraction mode q has also been varied from 1 to the tensor order p.

Symmetrically-Shaped Tensors

TTV has been executed with parameters tlib::execution::blas, tlib::slicing::large and tlib::loop_fusion::all

Asymmetrically-Shaped Tensors

TTV has been executed with parameters tlib::execution::blas, tlib::slicing::small and tlib::loop_fusion::all

Example

/*main.cpp*/
#include <vector>
#include <numeric>
#include <iostream>
#include <tlib/ttv.h>


int main()
{
  const auto q = 2ul; // contraction mode
  
  auto A = tlib::tensor<float>( {4,3,2} ); 
  auto B = tlib::tensor<float>( {3,1}   );
  std::iota(A.begin(),A.end(),1);
  std::fill(B.begin(),B.end(),1);

/*
  A =  { 1  5  9  | 13 17 21
         2  6 10  | 14 18 22
         3  7 11  | 15 19 23
         4  8 12  | 16 20 24 };

  B =   { 1 1 1 } ;
*/

  // computes mode-2 tensor-times-vector product with C(i,j) = A(i,k,j) * B(k)
  auto C1 = A (q)* B; 
  
/*
  C =  { 1+5+ 9 | 13+17+21
         2+6+10 | 14+18+22
         3+7+11 | 15+19+23
         4+8+12 | 16+20+24 };
*/
}

Compile with g++ -I../include/ -std=c++17 -Ofast -fopenmp main.cpp -o main and additionally -DUSE_OPENBLAS or -DUSE_INTELBLAS for fast execution.

Citation

If you want to refer to TTV as part of a research paper, please cite the article Design of a High-Performance Tensor-Vector Multiplication with BLAS

@inproceedings{ttv:bassoy:2019,
  author="Bassoy, Cem",
  editor="Rodrigues, Jo{\~a}o M. F. and Cardoso, Pedro J. S. and Monteiro, J{\^a}nio and Lam, Roberto and Krzhizhanovskaya, Valeria V. and Lees, Michael H. and Dongarra, Jack J. and Sloot, Peter M.A.",
  title="Design of a High-Performance Tensor-Vector Multiplication with BLAS",
  booktitle="Computational Science -- ICCS 2019",
  year="2019",
  publisher="Springer International Publishing",
  address="Cham",
  pages="32--45",
  isbn="978-3-030-22734-0"
}

bassoy/ttv