Fastor is a stack-based high performance tensor (multi-dimensional array) library written in modern C++ [C++11/14/17] with powerful in-built tensor algebraic functionalities (tensor contraction, permutation, reductions, special tensor groups etc). There are multiple paradigms that Fastor exploits:
- Operation minimisation or low FLOP/complexity reducing algorithms: Fastor relies on a domain-aware Expression Template (ET) engine that can not only perform lazy and delayed evaluation but also sophisticated mathematical transformation at compile time such as graph optimisation, nearly symbolic tensor algebraic manipulation to reduce the complexity of evaluation of BLAS and/or non-BLAS type expressions by orders of magnitude. Some of these functionalities are non-existent in other available C++ ET linear algebra libraries.
- SIMD/Data parallelism/Stream computing Fastor utilises explicit SIMD instructions (from SSE all the way to AVX512 and FMA).
- Zero overhead tensor algebraic functions Fastor incorporates statically dispatched bespoke kernels for a variety of tensor products using a priori knowledge of tensors either through compile-time specialisation or advanced topological studies or both.
Full documenation can be found under the Wiki pages.
Fastor provides a high level interface for tensor algebra. As a first example consider the following
Tensor<double> scalar; // A scalar
Tensor<double,6> vector6; // A vector
Tensor<double,4,5> matrix; // A second order tensor
Tensor<double,3,3,3> tensor_3; // A third order tensor with dimension 3x3x3
tensor_3.arange(0); // fill tensor with sequentially ascending numbers
print(tensor_3); // print out the tensor
tensor_3(0,2,1); // index a tensor
tensor_3(all,last,seq(0,2)); // slice a tensor
tensor_3.rank(); // get rank of tensor, 3 in this case
Tensor<float,2,2,2,2,1,2,2,4,3,2,3,3,6> tensor_13; // A 13th order tensor
will output the following
[0,:,:]
⎡ 0, 1, 2 ⎤
⎢ 3, 4, 5 ⎥
⎣ 6, 7, 8 ⎦
[1,:,:]
⎡ 9, 10, 11 ⎤
⎢ 12, 13, 14 ⎥
⎣ 15, 16, 17 ⎦
[2,:,:]
⎡ 18, 19, 20 ⎤
⎢ 21, 22, 23 ⎥
⎣ 24, 25, 26 ⎦
Einstein summation as well as summing over multiple (i.e. more than two) indices are supported. As a complete example, for instance, consider
#include <Fastor/Fastor.h>
using namespace Fastor;
enum {I,J,K,L,M,N};
int main() {
// An example of Einstein summation
Tensor<double,2,3,5> A; Tensor<double,3,5,2,4> B;
// fill A and B
A.random(); B.random();
auto C = einsum<Index<I,J,K>,Index<J,L,M,N>>(A,B);
// An example of summing over three indices
Tensor<double,5,5,5> D; D.random();
auto E = reduction(D);
// An example of tensor permutation
Tensor<float,3,4,5,2> F; F.random();
auto G = permutation<Index<J,K,I,L>>(F);
// Output the results
print("Our big tensors:",C,E,G);
return 0;
}
You can compile and run this by providing the following (or equivalent) flags to your compiler -std=c++14 -O3 -march=native -DNDEBUG
.
Fastor introduces powerful tensor views which make tensor indexing, slicing and broadcating look and feel native to scientific programmers. Consider the following examples
Tensor<double,4,3,10> A, B;
A.random(); B.random();
Tensor<double,2,2,5> C; Tensor<double,4,3,1> D;
// Dynamic views -> seq(first,last,step)
C = A(seq(0,2),seq(0,2),seq(0,last,2)); // C = A[0:2,0:2,0::2]
D = B(all,all,0) + A(all,all,last); // D = B[:,:,0] + A[:,:,-1]
A(2,all,3) = 5.0; // A[2,:,3] = 5.0
// Static views -> fseq<first,last,step>
C = A(fseq<0,2>(),fseq<0,2>(),fseq<0,last,2>()); // C = A[0:2,0:2,0::2]
D = B(fall,fall,fseq<0,1>()) + A(fall,fall,fseq<9,10>()); // D = B[:,:,0] + A[:,:,-1]
A(2,fall,3) = 5.0; // A[2,:,3] = 5.0
// Overlapping is also allowed without having undefined behaviour
A(seq(2,last),all,all).noalias() += A(seq(0,last-2),all,all); // A[2::,:,:] += A[::-2,:,:]
// Note that in case of perfect overlapping noalias is not required
A(seq(0,last-2),all,all) += A(seq(0,last-2),all,all); // A[::2,:,:] += A[::2,:,:]
// If instead of a tensor view, one needs an actual tensor the iseq could be used
// iseq<first,last,step>
C = A(iseq<0,2>(),iseq<0,2>(),iseq<0,last,2>()); // C = A[0:2,0:2,0::2]
// Note that iseq returns an immediate tensor rather than a tensor view and hence cannot appear
// on the left hand side, for instance
A(iseq<0,2>(),iseq<0,2>(),iseq<0,last,2>()) = 2; // Will not compile, as left operand is an rvalue
// One can also index a tensor with another tensor(s)
Tensor<float,10,10> E; E.fill(2);
Tensor<int,5> it = {0,1,3,6,8};
Tensor<size_t,10,10> t_it; t_it.arange();
E(it,0) = 2;
E(it,seq(0,last,3)) /= -1000.;
E(all,it) += E(all,it) * 15.;
E(t_it) -= 42 + E;
Aside from iseq
(which pretty much immediately returns another tensor), all other possible combination of slicing and broadcasting types are possible. For instance, one complex slicing and broadcasting example is given below
A(all,all) -= log(B(all,all,0)) + abs(B(all,fall,1)) + sin(C(all,0,all,0)) - 102. - cos(B(all,all,0));
From a performance point of view, Fastor tries very hard to vectorise (read SIMD vectorisation) tensor views, but this heavily depends on the compilers ability to inline multiple recursive functions [as is the case for all expression templates]. If a view appears on the right hand side of an assignment, but not on the left, Fastor automatically vectorises the expression. However if a view appears on the left hand side of an assignment, Fastor does not by default vectorise the expression. To enable vectorisation across all tensor views use the compiler flag -DFASTOR_USE_VECTORISE_EXPR_ASSIGN
. Also for performance reasons it is beneficial to avoid overlapping assignments, otherwise a copy will be made. If your code does not use any overlapping assignments, then this feature can be turned off completely by issusing -DFASTOR_NO_ALIAS
. At this stage it is also beneficial to consider that while compiling complex and big expressions the inlining limit of the compiler should be increased and tested i.e. -finline-limit=<big number>
for GCC, -mllvm -inline-threshold=<big number>
for Clang and -inline-forceinline
for ICC.
To see how efficient tensor views can be vectorised, as an example consider the following 4th order finite difference example for Laplace equation
Tensor<double,100,100> u, v;
// fill u and v
// A complex assignment expression involving multiple tensor views
u(seq(1,last-1),seq(1,last-1)) =
(( v(seq(0,last-2),seq(1,last-1)) + v(seq(2,last),seq(1,last-1)) +
v(seq(1,last-1),seq(0,last-2)) + v(seq(1,last-1),seq(2,last)) )*4.0 +
v(seq(0,last-2),seq(0,last-2)) + v(seq(0,last-2),seq(2,last)) +
v(seq(2,last),seq(0,last-2)) + v(seq(2,last),seq(2,last)) ) / 20.0;
using GCC 6.2
with -O3 -mavx2 -mfma -finline-limit=100000 -ffp-contract=fast -DNDEBUG -DFASTOR_NO_ALIAS -DFASTOR_USE_VECTORISE_EXPR_ASSIGN
the above expression compiles to
L129:
leaq -768(%rcx), %rdx
movq %rsi, %rax
.align 4,0x90
L128:
vmovupd 8(%rax), %ymm0
vmovupd (%rax), %ymm1
addq $32, %rdx
addq $32, %rax
vaddpd 1576(%rax), %ymm0, %ymm0
vaddpd 768(%rax), %ymm0, %ymm0
vaddpd 784(%rax), %ymm0, %ymm0
vfmadd132pd %ymm3, %ymm1, %ymm0
vaddpd -16(%rax), %ymm0, %ymm0
vaddpd 1568(%rax), %ymm0, %ymm0
vaddpd 1584(%rax), %ymm0, %ymm0
vdivpd %ymm2, %ymm0, %ymm0
vmovupd %ymm0, -32(%rdx)
cmpq %rdx, %rcx
jne L128
vmovupd 2376(%rsi), %xmm0
vaddpd 776(%rsi), %xmm0, %xmm0
addq $800, %rcx
addq $800, %rsi
vaddpd 768(%rsi), %xmm0, %xmm0
vaddpd 784(%rsi), %xmm0, %xmm0
vfmadd213pd -32(%rsi), %xmm5, %xmm0
vaddpd -16(%rsi), %xmm0, %xmm0
vaddpd 1568(%rsi), %xmm0, %xmm0
vaddpd 1584(%rsi), %xmm0, %xmm0
vdivpd %xmm4, %xmm0, %xmm0
vmovups %xmm0, -800(%rcx)
cmpq %r13, %rcx
jne L129
Aside from unaligned load and store instructions which are unavoidable the rest of the generated code is as efficient as it gets for an AVX2
architecture. On recent X86 architectures unaligned access as fast as aligned access anyway. This is specifically true for Fastor's stack allocated tensors as the data would potentially fit in L1 cache. With the help of an optimising compiler, Fastor's functionalities come closest to the ideal metal performance for numerical tensor algebra code.
A set of specialised tensors are available that provide optimised tensor algebraic computations, for instance SingleValueTensor
. Some of the computations performed on these tensors have almost zero cost no matter how big the tensor is. These tensors work in the exact same way as the Tensor
class and can be assigned to one another. Consider for example the einsum between two SingleValueTensor
s. A SingleValueTensor
is a tensor of any dimension and size whose elements are all the same (a matrix of ones for instance).
SingleValueTensor<double,20,20,30> a(3.51);
SingleValueTensor<double,20,30> b(2.76);
auto c = einsum<Index<0,1,2>,Index<0,2>>(a,b);
This will incur almost no runtime cost. As where if the tensors were of type Tensor
then a heavy computation would ensue.
A must have feature of every numerical linear algebra and even more so tensor contraction frameworks is lazy evaluation of arbitrary chained operations. Consider the following expression
Tensor<float,16,16,16,16> tn1 ,tn2, tn3, tn4;
tn1.random(); tn2.random(); tn3.random();
tn4 = 2*tn1+sqrt(tn2-tn3);
The above code is transparently converted to a single AVX
loop
for (size_t i=0; i<tn4.Size; i+=tn4.Stride)
_mm256_store_ps(tn4._data+i,_mm256_set1_ps(static_cast<float>(2))*_mm256_load_ps(tn1._data+i)+
_mm256_sqrt_ps(_mm256_sub_ps(_mm256_load_ps(tn2._data+i),_mm256_load_ps(tn3._data+i)));
avoiding any need for temporary memory allocation. As a DSL, Fastor has a much deeper understanding of the domain. By employing template metaprogrommaing techniques it ventures into the realm of smart expression templates to mathematically transform expression and/or apply compile time graph optimisation to find optimal contraction indices of complex tensor networks. As an example, the trace(matmul(transpose(A),B))
which is O(n^3)
in computational complexity is determined to be inefficient and Fastor statically dispatches the call to an equivalent but much more efficient routine, in this case A_ijB_ij
or doublecontract(A,B)
which is O(n^2)
. Further examples of such mathemtical transformation include (not inclusive)
// the l in-front of the names stands for 'lazy'
ldeterminant(linverse(A)); // transformed to 1/ldeterminant(A), O(n^3) reduction in computation
ltranspose(lcofactor(A)); // transformed to ladjoint(A), O(n^2) reduction in memory access
ltranspose(ladjoint(A)); // transformed to lcofactor(A), O(n^2) reduction in memory access
lmatmul(lmatmul(A,B),b); // transformed to lmatmul(A,lmatmul(B,b)), O(n) reduction in computation
// and many more
Note that there are situations that the user may write a complex chain of operations in the most verbose way, perhaps for readibility purposes, but Fastor delays the evaluation of the expression and checks if an equivalent but efficient expression can be computed. The computed expression always binds back to the base tensor, overhead free without a runtime (virtual table/pointer) penalty.
For tensor networks comprising of many higher rank tensors, a full generalisation of the above mathematical transformation can be performed through a constructive graph search optimisation. This typically involves finding the most optimal pattern of tensor contraction by studying the indices of contraction wherein tensor pairs are multiplied, summed over and factorised out in all possible combinations in order to come up with a cost model. Once again, knowing the dimensions of the tensor and the contraction pattern, Fastor performs this operation minimisation step at compile time and further checks the SIMD vectorisability of the tensor contraction loop nest (i.e. full/partial/broadcast vectorisation). In nutshell, it not only minimises the the number of floating point operations but also generates the most optimum vectorisable loop nest for computing those FLOPs. The following figures show the run time benefit of operation minimisation (FLOP optimal) over a single expression evaluation (Memory-saving) approach in contracting a three-tensor-network fitting in L1
, L2
and L3
caches, respectively
In nonlinear mechanics, it is customary to transform high order tensors to low rank tensors using Voigt transformation. Fastor has domain-specific features for such tensorial operations. For example, consider the dyadic product A_ik*B_jl
, that can be computed in Fastor like
Tensor<double,3,3> A,B;
A.random(); B.random();
Tensor<double,6,6> C = einsum<Index<0,2>,Index<1,3>,FASTOR_Voigt>(A,B);
// or alternatively
enum {I,J,K,L};
Tensor<double,6,6> D = einsum<Index<I,K>,Index<J,L>,FASTOR_Voigt>(A,B);
As you notice, all indices are resolved and the Voigt transformation is performed at compile time, keeping only the cost of computation at runtime. Equivalent implementation of this in C/Fortran requires either low-level for loop style programming that has an O(n^4) computational complexity and non-contiguous memory access, or if a function like einsum is desired the indices will need to be passed requiring potentially extra register allocation. Here is performance benchmark between Ctran (C/Fortran) for loop code and the equivalent Fastor implementation for the above example, run over a million times (both compiled using -O3 -mavx
, on Intel(R) Xeon(R) CPU E5-2650 v2 @2.60GHz
running Ubuntu 14.04
):
Notice that by compiling with the same flags, it is meant that the compiler is permitted to auto-vectorise the C/tran code as well. The real performance of Fastor comes from the fact, that when a Voigt transformation is requested, Fastor does not compute the elements which are not needed.
Building upon its domain specific features, Fastor implements the tensor cross product family of algebra recently introduced by Bonet et. al. in the context of numerical analysis of nonlinear classical mechanics which can significantly reduce the amount algebra involved in tensor derivatives of functionals which are forbiddingly complex to derive using a standard approach. The tensor cross product of two second order tensors is defined as C_iI = e_ijk*e_IJK*A_jJ*b_kK
where e
is the third order permutation tensor. As can be seen this product is O(n^6) in computational complexity (furthermore a cross product is essentially defined in 3-dimensional space i.e. perfectly suitable for stack allocation). Using Fastor the equivalent code is only 81 SSE intrinsics
// A and B are second order tensors
using Fastor::LeviCivita_pd;
Tensor<double,3,3> E = einsum<Index<i,j,k>,Index<I,J,K>,Index<j,J>,Index<k,K>>
(LeviCivita_pd,LeviCivita_pd,A,B);
// or simply
Tensor<double,3,3> F = cross(A,B);
Here is performance benchmark between Ctran (C/Fortran) code and the equivalent Fastor implementation for the above example, run over a million times (both compiled using -O3 -mavx
, on Intel(R) Xeon(R) CPU E5-2650 v2 @2.60GHz
running Ubuntu 14.04
):
Notice the almost two orders of magnitude performance gain using Fastor. Again the real performance gain comes from the fact that Fastor eliminates zeros from the computation.
A set of boolean tensor routines are available in Fastor. Note that, whenever possible most of these operations are performed at compile time
is_uniform(); // does the tensor span equally in all spatial dimensions, generalisation of square matrices
is_orthogonal();
does_belong_to_sl3(); // does the tensor belong to special linear 3D group
does_belong_to_so3(); // does the tensor belong to special orthogonal 3D group
is_symmetric(int axis_1, int axis_2); // is the tensor symmetric in the axis_1 x axis_2 plane
is_equal(B); // equality check with another tensor
is_identity();
All basic numerical linear algebra subroutines for small tensors (where the overhead of calling vendor/optimised BLAS
is typically not worth it) are fully SIMD optimised and efficiently implemented
Tensor<double,3,3> A,B;
// fill A and B
auto ab = matmul(A,B); // matrix matrix multiplication of A*B
auto a_norm = norm(A); // Frobenious norm of A
auto b_det = determinant(B); // determinant of B
auto a_inv = inverse(A); // inverse of A
auto b_cof = cofactor(B); // cofactor of B
Fastor utilises a bunch of meta-functions to perform most operations at compile time, consider the following examples
Tensor<double,3,4,5> A;
Tensor<double,5,3,4> B;
Tensor<double,3,3,3> C;
auto D = permutation<Index<2,0,1>>(A); // type of D is deduced at compile time as Tensor<double,5,3,4>
auto E = einsum<Index<I,J,K>,Index<L,M,N>>(D,B); // type of E is deduced at compile time as Tensor<double,5,3,4,5,3,4>
auto F = einsum<Index<I,I,J>>(C); // type of F is deduced at compile time as Tensor<double,3>
auto F2 = reduction(C); // type of F2 is deduced at compile time as scalar i.e. Tensor<double>
auto E2 = reduction(D,B); // type of E2 is deduced at compile time as Tensor<double>
Tensor<float,2,2> G,H;
trace(H); // trace of H, in other words H_II
reduction(G,H); // double contraction of G and H i.e. G_IJ*H_IJ
As you can observe with combination of permutation
, contraction
, reduction
and einsum
(which itself is a glorified wrapper over the first three) any type of tensor contraction, and permutation is possible, and using meta-programming the right amount of stack memory to be allocated is deduced at compile time.
Fastor is extremely light weight, it is a header-only library, requires no build or compilation process and has no external dependencies. It is written in pure C++11 from the foundation.
Fastor has been tested against the following compilers (on Ubuntu 14.04/16.04/18.04 and macOS). While compiling on macOS with Clang, -std=c++14
is necessary
- GCC 4.8, GCC 4.9, GCC 5.1, GCC 5.2, GCC 5.3, GCC 5.4, GCC 6.2, GCC 7.3, GCC 8, GCC 9.1
- Clang 3.6, Clang 3.7, Clang 3.8, Clang 3.9, Clang 5, Clang 7, Clang 8
- Intel 16.0.1, Intel 16.0.2, Intel 16.0.3, Intel 17.0.1, Intel 18.2, Intel 19.2
Fastor can be cited as
@Article{Poya2017,
author="Poya, Roman and Gil, Antonio J. and Ortigosa, Rogelio",
title = "A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics",
journal = "Computer Physics Communications",
year="2017",
doi = "http://dx.doi.org/10.1016/j.cpc.2017.02.016",
url = "http://www.sciencedirect.com/science/article/pii/S0010465517300681"
}