Post

A generic tensor library for .NET

EDIT: Revised the content to align with the changes introduced in version 2 of NetFabric.Numerics.Tensors.

In a previous article, I explored SIMD and its impact on improving mathematical calculations. However, dealing with the increased complexity introduced by SIMD can be challenging, requiring code adjustments for various algorithms and potentially elevating the risk of bugs. A reusable, highly-optimized iteration of Span<T> would serve as a valuable tool, saving considerable development and testing time.

As a software engineer, my objective is often to minimize code redundancy while maintaining optimal performance.

In a separate article, I introduced the concept of “value-type delegates”, functioning similarly to C# delegates for injecting custom code into existing algorithms. The crucial distinction is that, unlike C# delegates, they do not impact performance. This approach facilitates the creation of a library defining reusable SIMD-powered methods without compromising performance.

Tensors: Foundations and Applications

Tensors represent a fundamental concept in mathematics and physics, extending the idea of vectors into a more versatile framework. Essentially, they are mathematical entities represented by arrays of components, where each component is a function of the coordinates within a given space. Tensors come in various orders, signifying the number of indices needed to specify each component. For example, a scalar is a zeroth-order tensor, a vector is a first-order tensor, and matrices are second-order tensors. The adaptability of tensors proves invaluable in fields such as physics, engineering, and machine learning, allowing us to concisely and powerfully describe and manipulate complex relationships.

The widespread adoption of this concept can be traced back to the introduction of the TensorFlow and PyTorch libraries, sparking a recent revolution in deep learning. The ability to perform mathematical calculations using SIMD, whether on the CPU or GPU, has transformed the landscape of extensive mathematical computations.

While these libraries are predominantly developed in Python, there are .NET ports available, such as TensorFlow.NET, TorchSharp, and Torch.NET.

With the release of .NET 8, an updated version of the System.Numerics.Tensors library emerges. Leveraging the latest .NET capabilities, it provides direct support for low-level SIMD operations within the .NET environment. This library, smaller in size compared to its counterparts, is designed exclusively for CPU execution and does not offer tensor transformation capabilities. This characteristic makes it an attractive option for less complex scenarios, especially those not requiring the use of expensive NVIDIA GPUs.

NetFabric.Numerics.Tensors

In my ongoing work on my geometry library, my aim is to maximize vectorization, especially when dealing with collections of geometry objects. This library employs an object-oriented approach where each geometry object encapsulates its coordinate values. This differs from conventional tensor usage, where each coordinate is typically represented by a separate tensor. Moreover, the library embraces generic mathematics, utilizing generics to handle various element types. It’s important to note that the current version of System.Numerics.Tensors is confined to exclusively supporting the Single type (float in C# or float32 in F#).

Considering these factors, I’ve made the decision to develop my own open-source library named NetFabric.Numerics.Tensors. While drawing inspiration from System.Numerics.Tensors, it incorporates some distinctions:

  • NetFabric.Numerics.Tensors provides support for all value types implementing the generic math interfaces found in the System.Numerics namespace. This library builds on Vector<T>, also from System.Numerics, leveraging SIMD to enhance performance across all supported types. In contrast, as mentioned earlier, System.Numerics.Tensors is currently limited to supporting only Single.

  • System.Numerics.Tensors employs SIMD for tensor operations regardless of the number of elements. Conversely, NetFabric.Numerics.Tensors leverages SIMD only when the number of elements can fully occupy at least one Vector<T> for the specific system it’s running on, processing any remaining elements iteratively.

  • While System.Numerics.Tensors enjoys support across various .NET versions, including the .NET Framework, NetFabric.Numerics.Tensors is exclusively compatible with .NET 8.

  • NetFabric.Numerics.Tensors utilizes Vector<T> to leverage intrinsics (SIMD) for enhanced hardware acceleration. In contrast, System.Numerics.Tensors relies on Vector128, Vector256, and Vector512 for more detailed control over intrinsics. The latter provides specific control but lacks the generality and abstraction offered by Vector<T>. While Vector<T> abstracts SIMD complexity with generics support, it doesn’t allow explicit control over the vector size, consistently opting for the largest available size.

  • NetFabric.Numerics.Tensors accommodates tuples of values of the same type, streamlining operations on 2D, 3D, and 4D vectors without the need to duplicate coordinates into separate tensors.

  • NetFabric.Numerics.Tensors empowers third-party developers to create custom operations, a capability not currently present in System.Numerics.Tensors.

If your requirements involve exclusively predefined operations on Single, I suggest utilizing System.Numerics.Tensors. However, if you find yourself in need of support for different types and/or custom operators, I recommend exploring NetFabric.Numeric.Tensors. Give it a try, conduct benchmarks, and assess the performance gains it may offer. Below, you can find some benchmark results for reference.

Detailed documentation is provided on its dedicated documentation site. You can also access the source code for the library on GitHub. Feel free to delve into the repository for comprehensive insights, contribute to ongoing development, or report any encountered issues. Your exploration and involvement are encouraged.

Getting Started with NetFabric.Numerics.Tensors

To utilize NetFabric.Numerics.Tensors, you can easily obtain the System.Numerics.Tensors package from NuGet. Simply include it as a dependency in your project and import it wherever necessary (using NetFabric.Numerics.Tensors; in C# or open NetFabric.Numerics.Tensors in F#).

This library offers methods designed for operations involving one, two, or three ReadOnlySpan<T>, depending on the specific operation. The results are then provided in a Span<T>, which should be supplied as the last parameter. It’s essential to note that the destination Span<T> must be of the same size or larger than the sources. Inplace operations are supported when the destination parameter is the same as any of the sources.

NOTE: The examples are in C#, and I hope developers using other languages can still grasp the concepts presented.

For instance, given a variable data of type Span<int>, the following code snippet replaces each element of the span with its square root:

1
Tensor.Sqrt(data, data);

Note that since data serves as both the source and destination, the operation is performed inplace.

For variables x, y, and result, all of type Span<float> and of the same size, the following example updates each element in result with the sum of the corresponding elements in x and y:

1
Tensor.Add(x, y, result);

The library also supports aggregation operations. For a variable values of type Span<float>, the following snippet calculates the sum of all its elements:

1
var sum = Tensor.Sum(values);

Custom Operations

While NetFabric.Numerics.Tensors provides an array of primitive operations, combining them might not be efficient, necessitating multiple iterations over the sources. Foreseeing and implementing all potential combinations can be challenging. To tackle this, NetFabric.Numerics.Tensors supports the implementation of custom operators. This allows you to define the specific operation for each element of the source, while still benefiting from the high-performance reusable iteration code.

Within the domain of custom operations, NetFabric.Numerics.Tensors facilitates two main categories of operations on spans of data:

  • Apply: This operation employs one, two, or three source spans of data to apply a specified function, storing the result in the destination span. The operation can be executed in-place if the destination is the same as one of the sources. Additionally, it accommodates a value or a tuple instead of the second or third source spans.

  • Aggregate: This operation consolidates a source span of data into either a single value or a tuple of values.

These methods accept a generic type representing the operations to be applied to the source spans. The operator types must implement one of the following operator interfaces:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
public interface IOperator
{
    static virtual bool IsVectorizable => true;
}

public interface IUnaryOperator<T, TResult> : IOperator
    where T : struct
    where TResult : struct
{
    static abstract TResult Invoke(T x);
    static abstract Vector<TResult> Invoke(ref readonly Vector<T> x);
}

public interface IBinaryOperator<T1, T2, TResult> : IOperator
    where T1 : struct
    where T2 : struct
    where TResult : struct
{
    static abstract TResult Invoke(T1 x, T2 y);
    static abstract Vector<TResult> Invoke(ref readonly Vector<T1> x, ref readonly Vector<T2> y);
}

public interface IGenericBinaryOperator<T1, T2, TResult> : IOperator
    where T1 : struct
    where TResult : struct
{
    static abstract TResult Invoke(T1 x, T2 y);
    static abstract Vector<TResult> Invoke(ref readonly Vector<T1> x, T2 y);
}

public interface ITernaryOperator<T1, T2, T3, TResult> : IOperator
    where T1 : struct
    where T2 : struct
    where T3 : struct
    where TResult : struct
{
    static abstract TResult Invoke(T1 x, T2 y, T3 z);
    static abstract Vector<TResult> Invoke(ref readonly Vector<T1> x, ref readonly Vector<T2> y, ref readonly Vector<T3> z);
}

public interface IAggregationOperator<T, TResult> : IBinaryOperator<TResult, T, TResult>
    where T : struct
    where TResult : struct
{
    static virtual TResult Identity => Throw.NotSupportedException<TResult>();
    static abstract TResult Invoke(TResult x, TResult y);
    static abstract TResult Invoke(TResult value, ref readonly Vector<TResult> vector);
}

Take note that the interfaces utilize static virtual members, a feature introduced in .NET 7. Unlike the value delegates employed in my previous post, there’s no need to create an instance of the operator to utilize the methods. This also means that operators are pure and cannot have internal state.

Take, for instance, the square operator, responsible for computing the square of values:

1
2
3
4
5
6
7
8
9
public readonly struct SquareOperator<T> : IUnaryOperator<T>
    where T : struct, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x)
        => x * x;

    public static Vector<T> Invoke(ref readonly Vector<T> x)
        => x * x;
}

This is a unary operator, designed to operate on a single source. The generic type T is limited to struct and must implement IMultiplyOperators<T, T, T>, indicating that only value types with the * operator implemented can be used. The Invoke methods simply carry out the square operation for either a single T value or a Vector<T> of values.

Similarly, consider an addition operator, which computes the sum of values:

1
2
3
4
5
6
7
8
9
readonly struct AddOperator<T> : IBinaryOperator<T, T, T>
    where T : struct, IAdditionOperators<T, T, T>
{
    public static T Invoke(T x, T y)
        => x + y;

    public static Vector<T> Invoke(ref readonly Vector<T> x, ref readonly Vector<T> y)
        => x + y;
}

This is a binary operator, working on two sources, the addends. The generic type T is constrained to struct and must implement IAdditionOperators<T, T, T>, indicating that only value types with the + operator implemented can be used. The Invoke methods simply perform the addition operation for either a single T value or a Vector<T> of values.

Furthermore, an operator calculating the addition followed by multiplication of values is implemented as follows:

1
2
3
4
5
6
7
8
9
readonly struct AddMultiplyOperator<T> : ITernaryOperator<T, T, T, T>
    where T : struct, IAdditionOperators<T, T, T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x, T y, T z)
        => (x + y) * z;

    public static Vector<T> Invoke(ref readonly Vector<T> x, ref readonly Vector<T> y, ref readonly Vector<T> z)
        => (x + y) * z;
}

This is a ternary operator, handling three sources, the addends plus the multiplier. The generic type T is constrained to struct, IAdditionOperators<T, T, T>, and IMultiplyOperators<T, T, T>, indicating that only value types with the + and * operators implemented can be used. The Invoke methods simply perform the addition operation followed by multiplication for either a single T value or a Vector<T> of values.

Finally, an operator determining the sum of all elements of the source is implemented as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
readonly struct SumOperator<T> : IAggregationOperator<T, T>
    where T : struct, IAdditiveIdentity<T, T>, IAdditionOperators<T, T, T>
{
    public static T Identity
        => T.AdditiveIdentity;

    public static T Invoke(T x, ref readonly Vector<T> y)
        => x + Vector.Sum(y);

    public static T Invoke(T x, T y)
        => x + y;

    public static Vector<T> Invoke(ref readonly Vector<T> x, ref readonly Vector<T> y)
        => x + y;
}

This is an aggregation operator, delivering a value. The generic type T is constrained to struct, IAdditiveIdentity<T, T>, and IAdditionOperators<T, T, T>, indicating that only value types with the additive identity and the + operator implemented can be used. The Identity initializes the sum using the additive identity. The Invoke methods perform the addition of T and Vector<T> values.

It’s worth noting that certain operators may only be partially supported or not supported by Vector<T>. Take the shift left operation (<<), for instance. While Vector<T> only supports it for signed and unsigned integer primitives, any type implementing IShiftOperators<TSelf, TOther, TResult> can support left shift, including third-party-developed types. To accommodate all scenarios, a generic operator that isn’t vectorizable can be implemented, along with specific operators for each vectorizable type. The example below illustrates this approach, focusing on the sbyte type:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
readonly struct ShiftLeftOperator<T, TResult> : IGenericBinaryOperator<T, int, TResult>
    where T : struct, IShiftOperators<T, int, TResult>
    where TResult : struct
{
    public static bool IsVectorizable
        => false;

    public static TResult Invoke(T value, int count)
        => value << count;

    public static Vector<TResult> Invoke(ref readonly Vector<T> value, int count)
        => Throw.NotSupportedException<Vector<TResult>>();
}

readonly struct ShiftLeftSByteOperator : IGenericBinaryOperator<sbyte, int, sbyte>
{
    public static sbyte Invoke(sbyte value, int count)
        => (sbyte)(value << count);

    public static Vector<sbyte> Invoke(ref readonly Vector<sbyte> value, int count)
        => Vector.ShiftLeft(value, count);
}

Please be aware that IsVectorizable is included in every operator, and it automatically returns true by default. Overloading is unnecessary unless the intention is to return false.

Note that the operator implements IGenericBinaryOperator<T1, T2, TResult. This denotes a binary operator where the vectorization method accepts a value rather than a vector as the second parameter.

Using the operators

To employ the operators, you simply need to utilize either the Apply or Aggregate methods and provide the generic parameters along with the necessary method parameters. For instance, for the Add operation, the following overloads are provided:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public static void Add<T>(ReadOnlySpan<T> x, T y, Span<T> destination)
    where T : struct, IAdditionOperators<T, T, T>
    => Apply<T, AddOperator<T>>(x, y, destination);

public static void Add<T>(ReadOnlySpan<T> x, ValueTuple<T, T> y, Span<T> destination)
    where T : struct, IAdditionOperators<T, T, T>
    => Apply<T, AddOperator<T>>(x, y, destination);

public static void Add<T>(ReadOnlySpan<T> x, ValueTuple<T, T, T> y, Span<T> destination)
    where T : struct, IAdditionOperators<T, T, T>
    => Apply<T, AddOperator<T>>(x, y, destination);

public static void Add<T>(ReadOnlySpan<T> x, ReadOnlySpan<T> y, Span<T> destination)
    where T : struct, IAdditionOperators<T, T, T>
    => Apply<T, AddOperator<T>>(x, y, destination);

It’s noteworthy that these not only support the addition of values in two spans but also the addition of either a constant or a tuple of constants to the values in a span.

For the Sum operation, the following is provided:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public static T Sum<T>(ReadOnlySpan<T> source)
    where T : struct, IAdditionOperators<T, T, T>, IAdditiveIdentity<T, T>
    => Aggregate<T, SumOperator<T>>(source);

public static ValueTuple<T, T> Sum2D<T>(ReadOnlySpan<T> source)
    where T : struct, IAdditionOperators<T, T, T>, IAdditiveIdentity<T, T>
    => Aggregate2D<T, SumOperator<T>>(source);

public static ValueTuple<T, T, T> Sum3D<T>(ReadOnlySpan<T> source)
    where T : struct, IAdditionOperators<T, T, T>, IAdditiveIdentity<T, T>
    => Aggregate3D<T, SumOperator<T>>(source);

public static ValueTuple<T, T, T, T> Sum4D<T>(ReadOnlySpan<T> source)
    where T : struct, IAdditionOperators<T, T, T>, IAdditiveIdentity<T, T>
    => Aggregate4D<T, SumOperator<T>>(source);

For ShiftLeft, which supplies implementations for specific types, the compiler will automatically choose the suitable overload based on the type being utilized by providing the following methods:

1
2
3
4
5
6
7
8
9
10
11
public static void ShiftLeft<T>(ReadOnlySpan<T> value, int count, Span<T> destination)
    where T : struct, IShiftOperators<T, int, T>
    => ShiftLeft<T, T>(value, count, destination);

public static void ShiftLeft<T, TResult>(ReadOnlySpan<T> value, int count, Span<TResult> destination)
    where T : struct, IShiftOperators<T, int, TResult>
    where TResult : struct
    => ApplyGeneric<T, int, TResult, ShiftLeftOperator<T, TResult>>(value, count, destination);

public static void ShiftLeft(ReadOnlySpan<sbyte> value, int count, Span<sbyte> destination)
    => ApplyGeneric<sbyte, int, sbyte, ShiftLeftSByteOperator>(value, count, destination);

For conciseness, only the overload for sbyte type is shown here.

NetFabric.Numerics.Tensors provides these overloads for the primitive operators, but you can easily implement your own operator and use it in a similar manner.

Leveraging Tensors with Lists

The CollectionsMarshal.AsSpan() method provides access to the internal array of a List<T>, returning a Span<T> that seamlessly integrates with tensor operations, functioning as both a source and a destination. However, when used as a destination, it’s crucial to ensure that all items already exist, making it suitable for inplace operations.

1
2
var span = CollectionsMarshal.AsSpan(list);
Tensor.Square(span, span);

This code squares all items in the list inplace.

Working with tensors for structured data

The tensors in NetFabric.Numerics.Tensors can handle any value-type that meets the minimum requirements. For example, the following 2D vector implementation can be used in Sum because it implements both IAdditionOperators<T, T, T> and IAdditiveIdentity<T, T>:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public readonly record struct MyVector2<T>(T X, T Y)
    : IAdditiveIdentity<MyVector2<T>, MyVector2<T>>
    , IAdditionOperators<MyVector2<T>, MyVector2<T>, MyVector2<T>>
    where T : struct, INumber<T>
{
    public MyVector2(ValueTuple<T, T> tuple)
        : this(tuple.Item1, tuple.Item2)
    { }

    public static MyVector2<T> AdditiveIdentity
        => new(T.AdditiveIdentity, T.AdditiveIdentity);

    public static MyVector2<T> operator +(MyVector2<T> left, MyVector2<T> right)
        => new (left.X + right.X, left.Y + right.Y);
}

However, Vector<T> does not support this type directly, so the tensor cannot use SIMD to optimize the Sum.

Please note that the MyVector2<T> type comprises two fields of the same type and is always a value type, implying that they are stored adjacently in memory. Consequently, a Span<MyVector2<T>> can be effortlessly converted to a Span<T> by utilizing MemoryMarshal.Cast<MyVector2<T>, T>().

This capability enables the implementation of the following:

1
2
3
public static MyVector2<T> Sum<T>(this ReadOnlySpan<MyVector2<T>> source)
    where T : struct, INumber<T>, IMinMaxValue<T>
    => new(Tensor.Sum2D(MemoryMarshal.Cast<MyVector2<T>, T>(source)));

This implementation allows the tensor to leverage SIMD for enhanced performance in the Sum operation on a span of MyVector2<T>, interpreting it as a span of its internal values. It’s noteworthy that the use of Sum2D is essential, reflecting the intention to calculate the sum of every other item in the span.

Regarding Apply, the operation is applied to each element while maintaining the order in the destination span. In this context, applying MemoryMarshal.Cast<MyVector2<T>, T>() to both the sources and destination is adequate:

1
2
3
4
5
6
7
public static void Add<T>(ReadOnlySpan<MyVector2<T>> angles, MyVector2<T> value, Span<MyVector2<T>> result)
    where T : struct, INumber<T>, IMinMaxValue<T>
    => Tensor.Add(MemoryMarshal.Cast<MyVector2<T>, T>(angles), (value.X, value.Y), MemoryMarshal.Cast<MyVector2<T>, T>(result));

public static void Add<T>(ReadOnlySpan<MyVector2<T>> left, ReadOnlySpan<MyVector2<T>> right, Span<MyVector2<T>> result)
    where T : struct, INumber<T>, IMinMaxValue<T>
    => Tensor.Add(MemoryMarshal.Cast<MyVector2<T>, T>(left), MemoryMarshal.Cast<MyVector2<T>, T>(right), MemoryMarshal.Cast<MyVector2<T>, T>(result));

Benchmarks

I conducted benchmarks for various operations on the following system:

1
2
3
4
5
6
7
8
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3007/23H2/2023Update/SunValley3)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.101
  [Host]    : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Scalar    : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT
  Vector128 : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX
  Vector256 : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
  Vector512 : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

It performs the following bechmarks

  • Baseline_* - using a simple iteration without explicit optimizations.
  • LINQ_* - using LINQ (when available).
  • System_* - using System.Numerics.Tensors (only for Single, float in C#).
  • NetFabric_* - using NetFabric.Numerics.Tensors.

Every benchmark encompassed four distinct jobs:

  • Scalar - without any SIMD support
  • Vector128 - utilizing 128-bit SIMD support
  • Vector256 - utilizing 256-bit SIMD support
  • Vector512 - utilizing 512-bit SIMD support

The full benchmarking source code can be found here.

Add

Benchmarks performing addition on two spans (tensors), each containing 1,000 items,

The following serves as the baseline against which performance is evaluated:

1
2
3
4
5
6
7
8
9
10
11
public static void Add<T>(ReadOnlySpan<T> source, ReadOnlySpan<T> other, Span<T> result)
    where T : struct, IAdditionOperators<T, T, T>
{
    if (source.Length != other.Length)
        Throw.ArgumentException(nameof(source), "source and other spans must have the same length.");
    if (source.Length > result.Length)
        Throw.ArgumentException(nameof(source), "result spans is too small.");

    for(var index = 0; index < source.Length; index++)
        result[index] = source[index] + other[index];
}
MethodJobCategoriesCountMeanStdDevRatio
Baseline_DoubleScalarDouble1000311.04 ns2.651 nsbaseline
NetFabric_DoubleScalarDouble1000209.02 ns1.433 ns1.49x faster
Baseline_DoubleVector128Double1000310.25 ns2.740 ns1.00x faster
NetFabric_DoubleVector128Double1000134.08 ns2.219 ns2.32x faster
Baseline_DoubleVector256Double1000310.13 ns3.700 ns1.00x faster
NetFabric_DoubleVector256Double100093.20 ns0.675 ns3.34x faster
Baseline_DoubleVector512Double1000310.08 ns2.661 ns1.00x faster
NetFabric_DoubleVector512Double100087.48 ns0.741 ns3.56x faster
       
Baseline_FloatScalarFloat1000313.76 ns3.896 nsbaseline
System_FloatScalarFloat1000209.50 ns1.307 ns1.50x faster
NetFabric_FloatScalarFloat1000205.32 ns1.435 ns1.53x faster
Baseline_FloatVector128Float1000321.60 ns5.085 ns1.02x slower
System_FloatVector128Float100070.98 ns0.752 ns4.42x faster
NetFabric_FloatVector128Float100071.42 ns0.516 ns4.39x faster
Baseline_FloatVector256Float1000314.04 ns2.951 ns1.00x slower
System_FloatVector256Float100046.24 ns0.222 ns6.79x faster
NetFabric_FloatVector256Float100046.09 ns0.233 ns6.81x faster
Baseline_FloatVector512Float1000314.20 ns2.108 ns1.00x slower
System_FloatVector512Float100050.93 ns0.305 ns6.17x faster
NetFabric_FloatVector512Float100045.42 ns0.200 ns6.92x faster
       
Baseline_HalfScalarHalf10009,067.22 ns105.207 nsbaseline
NetFabric_HalfScalarHalf10009,022.14 ns98.395 ns1.01x faster
Baseline_HalfVector128Half10008,308.19 ns63.387 ns1.09x faster
NetFabric_HalfVector128Half10008,162.03 ns106.299 ns1.11x faster
Baseline_HalfVector256Half10008,280.55 ns71.927 ns1.10x faster
NetFabric_HalfVector256Half10008,174.82 ns77.175 ns1.11x faster
Baseline_HalfVector512Half10008,313.73 ns75.231 ns1.09x faster
NetFabric_HalfVector512Half10008,164.85 ns81.093 ns1.11x faster
       
Baseline_IntScalarInt1000354.66 ns3.378 nsbaseline
NetFabric_IntScalarInt1000214.07 ns1.596 ns1.66x faster
Baseline_IntVector128Int1000352.79 ns3.747 ns1.01x faster
NetFabric_IntVector128Int100082.25 ns0.631 ns4.31x faster
Baseline_IntVector256Int1000352.52 ns2.231 ns1.01x faster
NetFabric_IntVector256Int100052.30 ns0.315 ns6.78x faster
Baseline_IntVector512Int1000351.95 ns2.057 ns1.01x faster
NetFabric_IntVector512Int100052.60 ns0.378 ns6.74x faster
       
Baseline_LongScalarLong1000356.10 ns3.108 nsbaseline
NetFabric_LongScalarLong1000213.75 ns1.372 ns1.67x faster
Baseline_LongVector128Long1000354.82 ns2.280 ns1.00x faster
NetFabric_LongVector128Long1000147.92 ns1.259 ns2.41x faster
Baseline_LongVector256Long1000354.91 ns3.753 ns1.00x faster
NetFabric_LongVector256Long100091.50 ns0.747 ns3.89x faster
Baseline_LongVector512Long1000354.09 ns2.805 ns1.01x faster
NetFabric_LongVector512Long100090.98 ns0.637 ns3.91x faster
       
Baseline_ShortScalarShort1000425.04 ns3.359 nsbaseline
NetFabric_ShortScalarShort1000307.20 ns2.011 ns1.38x faster
Baseline_ShortVector128Short1000421.70 ns2.703 ns1.01x faster
NetFabric_ShortVector128Short100039.23 ns0.228 ns10.84x faster
Baseline_ShortVector256Short1000424.45 ns2.715 ns1.00x faster
NetFabric_ShortVector256Short100029.63 ns0.416 ns14.39x faster
Baseline_ShortVector512Short1000422.11 ns2.148 ns1.01x faster
NetFabric_ShortVector512Short100029.40 ns0.289 ns14.45x faster

Sum

Benchmarks performing the sum of the items in a span (tensor), containing 1,000 items.

The following serves as the baseline against which performance is evaluated:

1
2
3
4
5
6
7
8
public static T Sum<T>(ReadOnlySpan<T> source)
    where T : struct, IAdditiveIdentity<T, T>, IAdditionOperators<T, T, T>
{
    var sum = T.AdditiveIdentity;
    foreach (var item in source)
        sum += item;
    return sum;
}

It additionally compares with the performance of LINQ’s Sum(). However, it’s worth noting that this method lacks support for the types short and Half. In such instances, LINQ’s Aggregate() is employed instead.

MethodJobCategoriesCountMeanStdDevMedianRatio
Baseline_DoubleScalarDouble1000570.98 ns5.629 ns573.36 nsbaseline
LINQ_DoubleScalarDouble1000571.74 ns5.388 ns573.79 ns1.00x slower
NetFabric_DoubleScalarDouble1000147.88 ns2.157 ns148.22 ns3.86x faster
Baseline_DoubleVector128Double1000572.54 ns5.996 ns573.91 ns1.00x slower
LINQ_DoubleVector128Double1000570.89 ns5.926 ns571.20 ns1.00x faster
NetFabric_DoubleVector128Double1000277.94 ns2.981 ns278.61 ns2.05x faster
Baseline_DoubleVector256Double1000572.45 ns5.287 ns573.21 ns1.00x slower
LINQ_DoubleVector256Double1000573.01 ns6.732 ns574.56 ns1.00x slower
NetFabric_DoubleVector256Double1000128.70 ns1.034 ns128.53 ns4.44x faster
Baseline_DoubleVector512Double1000571.11 ns6.915 ns572.84 ns1.00x slower
LINQ_DoubleVector512Double1000570.82 ns4.606 ns572.51 ns1.00x faster
NetFabric_DoubleVector512Double1000128.17 ns1.055 ns128.33 ns4.46x faster
        
Baseline_FloatScalarFloat1000570.56 ns4.612 ns572.98 nsbaseline
LINQ_FloatScalarFloat10001,014.42 ns10.750 ns1,018.47 ns1.78x slower
System_FloatScalarFloat1000574.55 ns5.687 ns575.36 ns1.01x slower
NetFabric_FloatScalarFloat1000145.22 ns1.107 ns145.80 ns3.93x faster
Baseline_FloatVector128Float1000569.56 ns4.442 ns572.11 ns1.00x faster
LINQ_FloatVector128Float10001,207.31 ns8.692 ns1,209.80 ns2.12x slower
System_FloatVector128Float1000115.47 ns0.658 ns115.61 ns4.94x faster
NetFabric_FloatVector128Float1000127.75 ns1.536 ns126.71 ns4.46x faster
Baseline_FloatVector256Float1000568.82 ns5.199 ns567.58 ns1.00x faster
LINQ_FloatVector256Float10001,210.59 ns10.939 ns1,213.35 ns2.12x slower
System_FloatVector256Float100043.05 ns0.335 ns43.16 ns13.25x faster
NetFabric_FloatVector256Float100052.75 ns0.238 ns52.77 ns10.82x faster
Baseline_FloatVector512Float1000570.12 ns5.911 ns571.47 ns1.00x slower
LINQ_FloatVector512Float10001,401.84 ns10.311 ns1,405.26 ns2.46x slower
System_FloatVector512Float100020.96 ns0.204 ns20.99 ns27.23x faster
NetFabric_FloatVector512Float100052.38 ns0.445 ns52.32 ns10.89x faster
        
Baseline_HalfScalarHalf100012,303.82 ns51.396 ns12,320.83 nsbaseline
LINQ_HalfScalarHalf100012,569.00 ns42.232 ns12,580.73 ns1.02x slower
NetFabric_HalfScalarHalf10009,224.23 ns81.418 ns9,274.72 ns1.33x faster
Baseline_HalfVector128Half100011,958.97 ns44.551 ns11,980.43 ns1.03x faster
LINQ_HalfVector128Half100012,195.96 ns74.433 ns12,202.99 ns1.01x faster
NetFabric_HalfVector128Half10008,146.77 ns81.343 ns8,164.09 ns1.51x faster
Baseline_HalfVector256Half100011,973.93 ns108.398 ns11,984.58 ns1.03x faster
LINQ_HalfVector256Half100012,158.34 ns126.659 ns12,116.17 ns1.01x faster
NetFabric_HalfVector256Half10008,136.64 ns71.782 ns8,164.75 ns1.51x faster
Baseline_HalfVector512Half100011,966.10 ns63.814 ns11,992.15 ns1.03x faster
LINQ_HalfVector512Half100012,183.80 ns77.386 ns12,207.81 ns1.01x faster
NetFabric_HalfVector512Half10008,132.30 ns73.452 ns8,140.99 ns1.51x faster
        
Baseline_IntScalarInt1000209.13 ns1.815 ns209.57 nsbaseline
LINQ_IntScalarInt1000207.11 ns1.197 ns207.26 ns1.01x faster
NetFabric_IntScalarInt1000106.23 ns0.707 ns106.22 ns1.97x faster
Baseline_IntVector128Int1000221.71 ns1.209 ns221.74 ns1.06x slower
LINQ_IntVector128Int1000106.27 ns3.544 ns105.34 ns1.96x faster
NetFabric_IntVector128Int100059.76 ns0.899 ns60.06 ns3.50x faster
Baseline_IntVector256Int1000221.06 ns1.133 ns220.85 ns1.06x slower
LINQ_IntVector256Int100050.87 ns0.211 ns50.88 ns4.11x faster
NetFabric_IntVector256Int100033.41 ns0.293 ns33.41 ns6.26x faster
Baseline_IntVector512Int1000219.06 ns1.548 ns218.67 ns1.05x slower
LINQ_IntVector512Int100050.72 ns0.320 ns50.70 ns4.12x faster
NetFabric_IntVector512Int100033.69 ns0.330 ns33.72 ns6.21x faster
        
Baseline_LongScalarLong1000209.33 ns3.041 ns208.05 nsbaseline
LINQ_LongScalarLong1000208.01 ns1.683 ns208.05 ns1.01x faster
NetFabric_LongScalarLong1000106.83 ns0.630 ns106.97 ns1.96x faster
Baseline_LongVector128Long1000220.86 ns1.267 ns220.79 ns1.06x slower
LINQ_LongVector128Long1000205.47 ns1.143 ns205.74 ns1.02x faster
NetFabric_LongVector128Long1000110.76 ns0.171 ns110.71 ns1.89x faster
Baseline_LongVector256Long1000220.38 ns1.423 ns220.08 ns1.05x slower
LINQ_LongVector256Long1000108.83 ns2.826 ns108.70 ns1.93x faster
NetFabric_LongVector256Long100059.73 ns0.512 ns59.62 ns3.51x faster
Baseline_LongVector512Long1000220.13 ns2.014 ns220.33 ns1.05x slower
LINQ_LongVector512Long1000109.13 ns4.001 ns109.02 ns1.91x faster
NetFabric_LongVector512Long100060.10 ns0.533 ns60.23 ns3.48x faster
        
Baseline_ShortScalarShort1000398.96 ns4.568 ns397.85 nsbaseline
LINQ_ShortScalarShort1000747.23 ns4.973 ns746.78 ns1.87x slower
NetFabric_ShortScalarShort1000217.72 ns1.566 ns217.20 ns1.83x faster
Baseline_ShortVector128Short1000397.20 ns2.326 ns398.01 ns1.00x faster
LINQ_ShortVector128Short1000743.73 ns3.997 ns744.50 ns1.86x slower
NetFabric_ShortVector128Short100033.28 ns0.324 ns33.35 ns12.00x faster
Baseline_ShortVector256Short1000398.76 ns3.406 ns398.69 ns1.00x slower
LINQ_ShortVector256Short1000745.48 ns6.354 ns745.06 ns1.87x slower
NetFabric_ShortVector256Short100017.16 ns0.238 ns17.16 ns23.25x faster
Baseline_ShortVector512Short1000396.99 ns3.059 ns397.21 ns1.00x faster
LINQ_ShortVector512Short1000754.52 ns12.566 ns752.41 ns1.89x slower
NetFabric_ShortVector512Short100020.55 ns1.059 ns20.91 ns18.86x faster

Sum2D

Benchmarks performing the sum of the 2D vectors in a span (tensor), containing 1,000 vectors. The vector is a value type containing two fields of the same type.

It uses the same baseline as for the Sum benchmarks as it uses generics math to support any of these cases.

It also compares to the performance of LINQ’s Aggregate(), as LINQ’s Sum() does not support non-native numeric types.

MethodJobCategoriesCountMeanStdDevMedianRatio
Baseline_DoubleScalarDouble1000586.1 ns4.78 ns588.5 nsbaseline
LINQ_DoubleScalarDouble10002,452.1 ns22.31 ns2,457.8 ns4.18x slower
NetFabric_DoubleScalarDouble1000297.2 ns3.22 ns297.8 ns1.97x faster
Baseline_DoubleVector128Double1000586.9 ns5.41 ns589.8 ns1.00x slower
LINQ_DoubleVector128Double10002,449.7 ns17.31 ns2,454.0 ns4.18x slower
NetFabric_DoubleVector128Double1000295.5 ns3.26 ns296.2 ns1.98x faster
Baseline_DoubleVector256Double1000587.1 ns6.33 ns587.5 ns1.00x slower
LINQ_DoubleVector256Double10002,442.0 ns24.81 ns2,448.3 ns4.17x slower
NetFabric_DoubleVector256Double1000295.7 ns2.30 ns296.5 ns1.98x faster
Baseline_DoubleVector512Double1000587.2 ns6.54 ns587.1 ns1.00x slower
LINQ_DoubleVector512Double10002,444.4 ns26.33 ns2,448.4 ns4.17x slower
NetFabric_DoubleVector512Double1000294.6 ns2.54 ns294.5 ns1.99x faster
        
Baseline_FloatScalarFloat10002,363.4 ns25.96 ns2,362.2 nsbaseline
LINQ_FloatScalarFloat10007,681.9 ns81.63 ns7,698.7 ns3.25x slower
NetFabric_FloatScalarFloat1000301.4 ns2.27 ns302.0 ns7.84x faster
Baseline_FloatVector128Float10002,352.6 ns27.39 ns2,349.8 ns1.00x faster
LINQ_FloatVector128Float10007,659.4 ns99.62 ns7,606.3 ns3.24x slower
NetFabric_FloatVector128Float1000300.6 ns3.87 ns298.4 ns7.86x faster
Baseline_FloatVector256Float10002,356.3 ns19.77 ns2,362.4 ns1.00x faster
LINQ_FloatVector256Float10007,574.4 ns11.36 ns7,572.2 ns3.21x slower
NetFabric_FloatVector256Float1000301.4 ns2.48 ns302.0 ns7.84x faster
Baseline_FloatVector512Float10002,350.5 ns17.85 ns2,355.6 ns1.01x faster
LINQ_FloatVector512Float10007,595.7 ns10.69 ns7,598.3 ns3.21x slower
NetFabric_FloatVector512Float1000300.4 ns2.14 ns301.7 ns7.87x faster
        
Baseline_HalfScalarHalf100018,149.6 ns131.56 ns18,209.0 nsbaseline
LINQ_HalfScalarHalf100026,635.7 ns98.41 ns26,621.6 ns1.47x slower
NetFabric_HalfScalarHalf100018,169.1 ns183.74 ns18,227.2 ns1.00x slower
Baseline_HalfVector128Half100016,322.7 ns208.14 ns16,356.5 ns1.11x faster
LINQ_HalfVector128Half100025,550.1 ns468.29 ns25,370.3 ns1.41x slower
NetFabric_HalfVector128Half100016,194.8 ns159.24 ns16,128.4 ns1.12x faster
Baseline_HalfVector256Half100016,239.9 ns167.92 ns16,181.3 ns1.12x faster
LINQ_HalfVector256Half100025,348.3 ns119.29 ns25,384.0 ns1.40x slower
NetFabric_HalfVector256Half100016,329.8 ns144.27 ns16,265.6 ns1.11x faster
Baseline_HalfVector512Half100016,297.9 ns170.77 ns16,377.9 ns1.12x faster
LINQ_HalfVector512Half100025,375.8 ns228.32 ns25,389.2 ns1.40x slower
NetFabric_HalfVector512Half100016,224.8 ns178.99 ns16,314.0 ns1.12x faster
        
Baseline_IntScalarInt10001,518.2 ns8.91 ns1,519.8 nsbaseline
LINQ_IntScalarInt10005,532.7 ns55.53 ns5,539.8 ns3.65x slower
NetFabric_IntScalarInt1000207.2 ns1.65 ns207.2 ns7.33x faster
Baseline_IntVector128Int10001,515.0 ns9.10 ns1,518.0 ns1.00x faster
LINQ_IntVector128Int10005,414.7 ns97.83 ns5,405.6 ns3.56x slower
NetFabric_IntVector128Int1000206.3 ns1.16 ns206.2 ns7.36x faster
Baseline_IntVector256Int10001,519.4 ns10.55 ns1,520.6 ns1.00x slower
LINQ_IntVector256Int10005,433.3 ns99.97 ns5,389.2 ns3.59x slower
NetFabric_IntVector256Int1000206.2 ns1.20 ns206.6 ns7.36x faster
Baseline_IntVector512Int10001,515.9 ns8.19 ns1,516.9 ns1.00x faster
LINQ_IntVector512Int10005,371.7 ns80.45 ns5,356.9 ns3.54x slower
NetFabric_IntVector512Int1000207.2 ns0.95 ns206.9 ns7.33x faster
        
Baseline_LongScalarLong1000412.6 ns2.43 ns413.4 nsbaseline
LINQ_LongScalarLong10001,323.0 ns11.82 ns1,323.5 ns3.21x slower
NetFabric_LongScalarLong1000207.9 ns2.00 ns208.2 ns1.98x faster
Baseline_LongVector128Long1000413.5 ns2.48 ns412.9 ns1.00x slower
LINQ_LongVector128Long10001,357.5 ns13.62 ns1,358.9 ns3.29x slower
NetFabric_LongVector128Long1000207.5 ns2.44 ns207.2 ns1.99x faster
Baseline_LongVector256Long1000412.1 ns2.43 ns412.7 ns1.00x faster
LINQ_LongVector256Long10001,326.8 ns9.45 ns1,330.3 ns3.22x slower
NetFabric_LongVector256Long1000206.0 ns1.44 ns206.1 ns2.00x faster
Baseline_LongVector512Long1000412.8 ns2.59 ns413.4 ns1.00x slower
LINQ_LongVector512Long10001,339.4 ns9.54 ns1,341.8 ns3.25x slower
NetFabric_LongVector512Long1000205.8 ns0.93 ns205.8 ns2.00x faster
        
Baseline_ShortScalarShort10001,906.6 ns23.93 ns1,915.5 nsbaseline
LINQ_ShortScalarShort10008,428.6 ns109.49 ns8,407.6 ns4.42x slower
NetFabric_ShortScalarShort1000430.9 ns2.20 ns430.7 ns4.42x faster
Baseline_ShortVector128Short10001,881.3 ns2.34 ns1,881.2 ns1.01x faster
LINQ_ShortVector128Short10008,390.7 ns88.63 ns8,428.9 ns4.40x slower
NetFabric_ShortVector128Short1000430.6 ns3.47 ns430.1 ns4.43x faster
Baseline_ShortVector256Short10001,881.9 ns2.30 ns1,882.2 ns1.01x faster
LINQ_ShortVector256Short10008,400.1 ns77.78 ns8,419.1 ns4.41x slower
NetFabric_ShortVector256Short1000434.4 ns5.98 ns432.7 ns4.39x faster
Baseline_ShortVector512Short10002,057.9 ns95.45 ns2,041.6 ns1.07x slower
LINQ_ShortVector512Short10008,948.6 ns518.20 ns8,685.5 ns4.62x slower
NetFabric_ShortVector512Short1000438.0 ns4.59 ns438.0 ns4.35x faster

Conclusions

The NetFabric.Numerics.Tensors library employs optimizations beyond just Vector<T>, resulting in superior performance compared to the basic baseline, even in the absence of vectorization support.

In cases where SIMD vectors are supported, the performance boost scales proportionally with the capacity of a Vector<T>. Particularly, with smaller types, the performance gains are more prominent. For larger types, noticeable benefits emerge when the platform accommodates larger vectors. In both scenarios, the code functions seamlessly without necessitating special handling.

The benchmarks demonstrate performance improvements for relatively large spans. Although presented here in a condensed format for space considerations, I have also conducted benchmarks for very small spans and observed no deteriorations in performance.

Although NetFabric.Numerics.Tensors provides numerous primitive operations, it also allows the implementation of custom operators. The extensive range of these primitive operations supported illustrates that the few interfaces provided can effectively cover a wide array of scenarios.

In contrast, Vector<T> is deficient in low-level operations, which is used for vectorization. As a result, the scope of vectorized operations is constrained to its existing API. Contributions to introduce new operations or vectorize existing ones are highly encouraged.

For enhanced tensor features and improved performance with GPU utilization, consider exploring alternative libraries such as TorchSharp.

I view System.Numerics.Tensors as an exemple of achieving “clean code” without compromising significant performance. It’s crucial to grasp how compilers handle your code and understand the system’s execution. I also recommend exploring my related post, “A 12% improvement, easily obtained, is never considered marginal – Donald Knuth.”

This post is licensed under CC BY 4.0 by the author.