Post

Unleashing Parallelism: A Guide to Efficient Span Processing with SIMD and Threading in C#

Modern CPUs offer three types of parallel processing. This article outlines several steps for leveraging all three to process large data sets as efficiently as possible.

Base Implementation

To illustrate, let’s implement a method that adds elements from one span to corresponding elements of another span and stores the result in a destination span. This operation is commonly encountered when working with tensors, frequently used in deep learning.

1
2
3
4
5
6
static void For<T>(ReadOnlySpan<T> left, ReadOnlySpan<T> right, Span<T> destination)
    where T : struct, IAdditionOperators<T, T, T>
{
    for (var index = 0; index < left.Length; index++)
        destination[index] = left[index] + right[index];
}

This method employs a for loop to iterate over the spans and perform element-wise addition using generic math, allowing it to work with any scalar type.

For a deeper understanding of generic math, refer to my other article “Generic math in .NET”.

The method is named For so that its performance can be compared in the benchmarks to the next steps of the implementation.

Bounds Checking Elimination

By default, C# implements bounds checking to ensure that accesses to spans remain within their boundaries, thereby enhancing code robustness at the expense of performance.

Using SharpLab, we can observe that the assembly code genrated by the JIT compiler includes bounds checking to verify whether the index falls within the span’s bounds. If it exceeds the bounds, an exception is thrown.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
L0000: push rsi
L0001: push rbx
L0002: sub rsp, 0x28
L0006: mov rax, [rcx]
L0009: mov ecx, [rcx+8]
L000c: xor r10d, r10d
L000f: test ecx, ecx
L0011: jle short L003c
L0013: cmp r10d, [r8+8]
L0017: jae short L0043 ;bounds check
L0019: mov r9, [r8]
L001c: mov r11d, r10d
L001f: mov ebx, [rax+r11*4]
L0023: cmp r10d, [rdx+8]
L0027: jae short L0043 ;bounds check
L0029: mov rsi, [rdx]
L002c: add ebx, [rsi+r11*4]
L0030: mov [r9+r11*4], ebx
L0034: inc r10d
L0037: cmp r10d, ecx
L003a: jl short L0013
L003c: add rsp, 0x28
L0040: pop rbx
L0041: pop rsi
L0042: ret
L0043: call 0x00007ff88e200da0 ;throws exception
L0048: int3

To manually eliminate bounds checking, we can use MemoryMarshal.GetReference() along with Unsafe.Add().

1
2
3
4
5
6
7
8
9
10
11
static void For_GetReference<T>(ReadOnlySpan<T> left, ReadOnlySpan<T> right, Span<T> destination)
    where T : struct, IAdditionOperators<T, T, T>
{
    ref var leftRef = ref MemoryMarshal.GetReference(left);
    ref var rightRef = ref MemoryMarshal.GetReference(right);
    ref var destinationRef = ref MemoryMarshal.GetReference(destination);

    var end = left.Length;
    for (var index = 0; index < end; index++)
        Unsafe.Add(ref destinationRef, index) = Unsafe.Add(ref leftRef, index) + Unsafe.Add(ref rightRef, index);
}

The method MemoryMarshal.GetReference() retrieves a reference to the start of the span. To access its elements, we use Unsafe.Add() with the index as the offset from the beginning.

Using SharpLab to examine the generated assembly code, we can see that it’s much simpler and there are no bounds checking.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
L0000: mov rax, [rcx]
L0003: mov rdx, [rdx]
L0006: mov r8, [r8]
L0009: mov ecx, [rcx+8]
L000c: xor r10d, r10d
L000f: test ecx, ecx
L0011: jle short L0037
L0013: nop [rax+rax]
L0018: nop [rax+rax]
L0020: movsxd r9, r10d
L0023: mov r11d, [rax+r9*4]
L0027: add r11d, [rdx+r9*4]
L002b: mov [r8+r9*4], r11d
L002f: inc r10d
L0032: cmp r10d, ecx
L0035: jl short L0020
L0037: ret

The benchmarks below conclusively demonstrate the significantly improved efficiency of this code.

So far, parallelization hasn’t been employed. The primary aim was to establish an efficient baseline that we can enhance exclusively through parallelization techniques.

CPU Branching and Parallelization

In my previous article on CPU branching and parallelization, I discussed how modern CPUs use hardware-level mechanisms for automatic parallelization, leveraging available registers and execution units. By consolidating multiple operations within a single for loop iteration, the CPU can efficiently exploit these capabilities, resulting in enhanced performance and reduced overhead from logical branching. This approach significantly improves execution speed and resource utilization, especially in scenarios involving repetitive computations or data processing tasks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
static void For_GetReference_4<T>(ReadOnlySpan<T> left, ReadOnlySpan<T> right, Span<T> destination, int index = 0)
    where T : struct, IAdditionOperators<T, T, T>
{
    ref var leftRef = ref MemoryMarshal.GetReference(left);
    ref var rightRef = ref MemoryMarshal.GetReference(right);
    ref var destinationRef = ref MemoryMarshal.GetReference(destination);

    var end = left.Length - 3;
    for (; index < end; index += 4)
    {
        Unsafe.Add(ref destinationRef, index) = Unsafe.Add(ref leftRef, index) + Unsafe.Add(ref rightRef, index);
        Unsafe.Add(ref destinationRef, index + 1) = Unsafe.Add(ref leftRef, index + 1) + Unsafe.Add(ref rightRef, index + 1);
        Unsafe.Add(ref destinationRef, index + 2) = Unsafe.Add(ref leftRef, index + 2) + Unsafe.Add(ref rightRef, index + 2);
        Unsafe.Add(ref destinationRef, index + 3) = Unsafe.Add(ref leftRef, index + 3) + Unsafe.Add(ref rightRef, index + 3);
    }

    switch (left.Length - index)
    {
        case 3:
            Unsafe.Add(ref destinationRef, index) = Unsafe.Add(ref leftRef, index) + Unsafe.Add(ref rightRef, index);
            Unsafe.Add(ref destinationRef, index + 1) = Unsafe.Add(ref leftRef, index + 1) + Unsafe.Add(ref rightRef, index + 1);
            Unsafe.Add(ref destinationRef, index + 2) = Unsafe.Add(ref leftRef, index + 2) + Unsafe.Add(ref rightRef, index + 2);
            break;
        case 2:
            Unsafe.Add(ref destinationRef, index) = Unsafe.Add(ref leftRef, index) + Unsafe.Add(ref rightRef, index);
            Unsafe.Add(ref destinationRef, index + 1) = Unsafe.Add(ref leftRef, index + 1) + Unsafe.Add(ref rightRef, index + 1);
            break;
        case 1:
            Unsafe.Add(ref destinationRef, index) = Unsafe.Add(ref leftRef, index) + Unsafe.Add(ref rightRef, index);
            break;
    }
}

This method optimizes performance by processing four operations per loop iteration. Since the span’s length may not be a multiple of 4, the remaining operations are handled after the loop. To minimize logical branching, a switch statement is used to handle the cases where 3, 2, or 1 operations are remaining.

The index start value is passed in as a parameter as it will be useful for the next steps.

The benchmarks below conclusively demonstrate the significantly improved efficiency of this code.

SIMD

As detailed in my earlier article Single Instruction, Multiple Data (SIMD) in .NET, .NET furnishes developers with two distinct APIs for SIMD operations. I prefer using Vector<T>, found in the System.Numerics namespace, due to its simplicity, support for all native scalar types, and suitability for this particular scenario.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static void For_SIMD<T>(ReadOnlySpan<T> left, ReadOnlySpan<T> right, Span<T> destination)
    where T : struct, IAdditionOperators<T, T, T>
{
    var indexSource = 0;

    if (Vector.IsHardwareAccelerated && Vector<T>.IsSupported)
    {
        var leftVectors = MemoryMarshal.Cast<T, Vector<T>>(left);
        var rightVectors = MemoryMarshal.Cast<T, Vector<T>>(right);
        var destinationVectors = MemoryMarshal.Cast<T, Vector<T>>(destination);

        ref var leftVectorsRef = ref MemoryMarshal.GetReference(leftVectors);
        ref var rightVectorsRef = ref MemoryMarshal.GetReference(rightVectors);
        ref var destinationVectorsRef = ref MemoryMarshal.GetReference(destinationVectors);

        var endVectors = leftVectors.Length;
        for (var indexVector = 0; indexVector < endVectors; indexVector++)
        {
            Unsafe.Add(ref destinationVectorsRef, indexVector) = Unsafe.Add(ref leftVectorsRef, indexVector) + Unsafe.Add(ref rightVectorsRef, indexVector);
        }

        indexSource = leftVectors.Length * Vector<T>.Count;
    }

    For_GetReference_4(left, right, destination, indexSource);
}

The method verifies whether SIMD is supported by the hardware and whether the data type is compatible with this API. If both conditions are met, it employs the MemoryMarshal.Cast<TFrom, TTo>() method to convert the Span<T> into a Span<Vector<T>> without copying its contents. Operations performed on Vector<T> are automatically optimized to utilize SIMD capabilities.

Now, it merely needs to execute operations on the elements of the spans of vectors, leveraging MemoryMarshal.GetReference() and Unsafe.Add() to eliminate bounds checking.

Once more, there may be remaining elements that don’t align with a complete Vector<T>. In such cases, we can utilize the previously defined For_GetReference_4() method, passing the index of the first element to process.

The benchmarks below conclusively demonstrate the significantly improved efficiency of this code.

Tensors

As evident from the complexity of the code, implementing and maintaining such operations for each function can be burdensome. Thankfully, there are libraries available that streamline this process by utilizing value-type delegates.

System.Numerics.Tensors is a NuGet package developed and maintained by the .NET team. It implements many span operations using the patterns discussed above. For example, addition operations can be invoked as follows:

1
TensorPrimitives.Add<int>(sourceInt32, otherInt32, resultInt32);

NetFabric.Numerics.Tensors, an alternative developed by me, arose from the initial limitations of System.Numerics.Tensors regarding native scalar types. While it is set to support these types in future releases, NetFabric.Numerics.Tensors currently offers broader support and facilitates the development of custom operations.

Similarly, it supports addition operations, which can be called as follows:

1
TensorOperations.Add<int>(sourceInt32!, otherInt32!, resultInt32!);

Multi-core CPUs

Most modern CPUs provide multiple cores. The code as implemented previoulsy will make use of only one of the cores. The operation that we are implementing is highly parallelizable as there are no element dependencies. The spans can be sliced and each slice processed by a different core.

Parallel.For is commonly used in .NET to parallelize operations.

1
2
3
4
5
static void Parallel_For<T>(T[] left, T[] right, T[] destination)
    where T : struct, IAdditionOperators<T, T, T>
{
    Parallel.For(0, left.Length, index => destination[index] = left[index] + right[index]);
}

In this scenario, the issue arises from the fact that a lambda expression is passed as a parameter and applied to each element. As a result, SIMD cannot be used. An alternative approach involves applying the lambda expression to slices of the spans, requiring a different strategy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
static void Parallel_Invoke_System_Numerics_Tensors<T>(ReadOnlyMemory<T> left, ReadOnlyMemory<T> right, Memory<T> destination)
    where T : struct, IAdditionOperators<T, T, T>
{
    const int minChunkCount = 4;
    const int minChunkSize = 1_000;

    var coreCount = Environment.ProcessorCount;

    if (coreCount >= minChunkCount && left.Length > minChunkCount * minChunkSize)
        ParallelApply(left, right, destination, coreCount);
    else
        TensorPrimitives.Add(left.Span, right.Span, destination.Span);

    static void ParallelApply(ReadOnlyMemory<T> left, ReadOnlyMemory<T> right, Memory<T> destination, int coreCount)
    {
        var totalSize = left.Length;
        var chunkSize = int.Max(totalSize / coreCount, minChunkSize);

        var actions = GC.AllocateArray<Action>(totalSize / chunkSize);
        var start = 0;
        for (var index = 0; index < actions.Length; index++)
        {
            var length = (index == actions.Length - 1)
                ? totalSize - start
                : chunkSize;

            var leftSlice = left.Slice(start, length);
            var rightSlice = right.Slice(start, length);
            var destinationSlice = destination.Slice(start, length);
            actions[index] = () => TensorPrimitives.Add(leftSlice.Span, rightSlice.Span, destinationSlice.Span);

            start += length;
        }
        Parallel.Invoke(actions);
    }
}

Multi-threading can significantly boost performance when processing large datasets. However, the overhead of managing threads may outweigh the benefits for smaller datasets. To address this, the code introduces two constants: minChunkSize, which sets the minimum data size per thread, and minChunkCount, determining the minimum number of threads to use.

Please note that the method parameters are now of type Memory<T> instead of Span<T>. This change is due to the fact that Span<T> is a ref struct, meaning it is a value type that can only exist on the stack. It cannot be boxed. The creation of closures for the actions requires the use of reference types, which Memory<T> provides.

The internal method ParallelApply() orchestrates the threading logic. It’s invoked only if the CPU core count equals or exceeds the minimum thread count and if the span length contains sufficient data to fill the minimum number of threads with the minimum number of elements. Otherwise, the single-threaded TensorPrimitives.Add() method provided by System.Numerics.Tensors is used.

Internally, ParallelApply() leverages Parallel.Invoke() to execute actions concurrently. It generates a maximum number of actions equal to the CPU core count, slicing the spans so that each action handles a distinct portion. However, it may generate fewer actions than cores if there isn’t enough data to warrant additional threads. Each action then calls the TensorPrimitives.Add() method on its assigned slice.

Our latest implementation fully leverages modern CPU parallelization techniques. This integration optimizes performance by employing SIMD operations, threading, and efficient data processing strategies. This approach accelerates processing speed and resource use, making it ideal for handling demanding computational tasks.

Alternatively, you can use the TensorOperations.Add() method from the NetFabric.Numerics.Tensors library. I am actively enhancing this library to seamlessly integrate multi-threading support, simplifying the utilization of multi-threading for any span operation.

Benchmarks

By leveraging BenchmarkDotNet, we precisely quantify the performance enhancements achieved at each implementation stage. You can access the benchmarks here to review the complete code and execute them on your own system to validate the results.

Below are the benchmark results obtained on the following system configuration:

1
2
3
4
5
6
7
8
9
10
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]    : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Scalar    : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT
  Vector128 : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX
  Vector256 : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX2
  Vector512 : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

OutlierMode=DontRemove  MemoryRandomization=True

The benchmarks were conducted for both int and float types, using spans containing 111,111 elements. This a large enough to make multi-threading usefull and it’s an odd number so that all code paths are executed.

Additionally, benchmarks were performed under various SIMD availability scenarios: Scalar (SIMD unavailable), 128-bit SIMD, 256-bit SIMD, and 512-bit SIMD.

Explore my other article, “Unit Testing and Benchmarking SIMD in .NET”, for deeper insights into how this process functions.

MethodJobCategoriesCountMeanStdDevMedianRatioGen0AllocatedAlloc Ratio
Int32_ForScalarInt3211111141.420 μs2.0507 μs40.827 μsbaseline--NA
Int32_For_GetReferenceScalarInt3211111145.204 μs1.5323 μs44.780 μs1.09x slower--NA
Int32_For_GetReference4ScalarInt3211111131.513 μs0.9492 μs31.354 μs1.33x faster--NA
Int32_For_SIMDScalarInt3211111131.612 μs0.9873 μs31.481 μs1.33x faster--NA
Int32_System_Numerics_TensorScalarInt3211111132.421 μs0.5159 μs32.528 μs1.29x faster--NA
Int32_Parallel_ForScalarInt3211111146.724 μs3.6772 μs46.217 μs1.14x slower0.54934880 BNA
Int32_Parallel_InvokeScalarInt3211111110.456 μs0.2746 μs10.516 μs4.03x faster0.59514994 BNA
Int32_Parallel_Invoke_SIMDScalarInt3211111110.341 μs0.2540 μs10.344 μs4.03x faster0.59515045 BNA
Int32_ForVector128Int3211111139.906 μs0.3442 μs39.849 μs1.05x faster--NA
Int32_For_GetReferenceVector128Int3211111144.182 μs0.3725 μs44.229 μs1.06x slower--NA
Int32_For_GetReference4Vector128Int3211111130.396 μs0.3119 μs30.383 μs1.38x faster--NA
Int32_For_SIMDVector128Int3211111112.678 μs1.0770 μs12.541 μs3.29x faster--NA
Int32_System_Numerics_TensorVector128Int3211111111.937 μs2.9772 μs11.275 μs3.61x faster--NA
Int32_Parallel_ForVector128Int3211111149.473 μs1.6099 μs49.377 μs1.19x slower0.54934826 BNA
Int32_Parallel_InvokeVector128Int3211111110.831 μs0.1759 μs10.805 μs3.87x faster0.59514992 BNA
Int32_Parallel_Invoke_SIMDVector128Int321111116.214 μs0.0856 μs6.226 μs6.75x faster0.54174536 BNA
Int32_ForVector256Int3211111141.040 μs1.0313 μs40.985 μs1.02x faster--NA
Int32_For_GetReferenceVector256Int3211111144.586 μs0.6762 μs44.652 μs1.07x slower--NA
Int32_For_GetReference4Vector256Int3211111131.356 μs0.5204 μs31.378 μs1.34x faster--NA
Int32_For_SIMDVector256Int3211111110.755 μs0.2100 μs10.739 μs3.89x faster--NA
Int32_System_Numerics_TensorVector256Int3211111111.326 μs1.6550 μs11.022 μs3.77x faster--NA
Int32_Parallel_ForVector256Int3211111141.164 μs0.7934 μs40.924 μs1.02x faster0.54934942 BNA
Int32_Parallel_InvokeVector256Int3211111110.103 μs0.2397 μs10.139 μs4.13x faster0.59515021 BNA
Int32_Parallel_Invoke_SIMDVector256Int321111115.587 μs0.1010 μs5.559 μs7.51x faster0.54174506 BNA
Int32_ForVector512Int3211111139.403 μs0.7315 μs39.303 μs1.06x faster--NA
Int32_For_GetReferenceVector512Int3211111143.781 μs0.6366 μs43.905 μs1.05x slower--NA
Int32_For_GetReference4Vector512Int3211111131.453 μs0.9483 μs31.375 μs1.33x faster--NA
Int32_For_SIMDVector512Int3211111110.582 μs0.3608 μs10.456 μs3.95x faster--NA
Int32_System_Numerics_TensorVector512Int3211111111.351 μs0.7147 μs11.324 μs3.68x faster--NA
Int32_Parallel_ForVector512Int3211111141.070 μs0.8726 μs41.126 μs1.02x faster0.54934915 BNA
Int32_Parallel_InvokeVector512Int3211111110.042 μs0.1468 μs9.995 μs4.17x faster0.59515019 BNA
Int32_Parallel_Invoke_SIMDVector512Int321111115.604 μs0.1048 μs5.620 μs7.47x faster0.54174500 BNA
           
Single_ForScalarSingle11111134.365 μs0.5454 μs34.402 μsbaseline--NA
Single_For_GetReferenceScalarSingle11111125.741 μs0.4834 μs25.673 μs1.34x faster--NA
Single_For_GetReference4ScalarSingle11111124.754 μs0.2640 μs24.759 μs1.39x faster--NA
Single_For_SIMDScalarSingle11111124.919 μs0.2662 μs24.873 μs1.38x faster--NA
Single_System_Numerics_TensorScalarSingle11111124.337 μs0.3864 μs24.388 μs1.41x faster--NA
Single_Parallel_ForScalarSingle11111140.225 μs0.9705 μs40.129 μs1.17x slower0.54934895 BNA
Single_Parallel_InvokeScalarSingle1111118.873 μs0.1662 μs8.839 μs3.87x faster0.57984920 BNA
Single_Parallel_Invoke_SIMDScalarSingle1111118.878 μs0.1184 μs8.858 μs3.87x faster0.59514946 BNA
Single_ForVector128Single11111134.610 μs0.8964 μs34.502 μs1.01x slower--NA
Single_For_GetReferenceVector128Single11111125.786 μs0.3613 μs25.759 μs1.33x faster--NA
Single_For_GetReference4Vector128Single11111124.948 μs0.4087 μs25.093 μs1.38x faster--NA
Single_For_SIMDVector128Single11111111.545 μs0.3195 μs11.408 μs3.00x faster--NA
Single_System_Numerics_TensorVector128Single11111111.834 μs1.4373 μs11.189 μs3.00x faster--NA
Single_Parallel_ForVector128Single11111141.137 μs0.7078 μs41.019 μs1.20x slower0.54934945 BNA
Single_Parallel_InvokeVector128Single1111118.801 μs0.1852 μs8.760 μs3.90x faster0.59514959 BNA
Single_Parallel_Invoke_SIMDVector128Single1111115.759 μs0.1627 μs5.740 μs5.96x faster0.54934580 BNA
Single_ForVector256Single11111134.603 μs0.4754 μs34.639 μs1.01x slower--NA
Single_For_GetReferenceVector256Single11111125.683 μs0.3307 μs25.708 μs1.34x faster--NA
Single_For_GetReference4Vector256Single11111124.840 μs0.3187 μs24.710 μs1.38x faster--NA
Single_For_SIMDVector256Single11111110.808 μs0.2226 μs10.799 μs3.18x faster--NA
Single_System_Numerics_TensorVector256Single11111111.331 μs0.8926 μs11.241 μs3.03x faster--NA
Single_Parallel_ForVector256Single11111140.993 μs1.8325 μs40.735 μs1.18x slower0.54934940 BNA
Single_Parallel_InvokeVector256Single1111118.816 μs0.1297 μs8.799 μs3.90x faster0.59514960 BNA
Single_Parallel_Invoke_SIMDVector256Single1111115.579 μs0.1175 μs5.567 μs6.15x faster0.54174526 BNA
Single_ForVector512Single11111134.520 μs0.5822 μs34.358 μs1.00x slower--NA
Single_For_GetReferenceVector512Single11111125.769 μs0.3657 μs25.700 μs1.33x faster--NA
Single_For_GetReference4Vector512Single11111124.979 μs0.2496 μs24.966 μs1.38x faster--NA
Single_For_SIMDVector512Single11111110.849 μs0.2199 μs10.853 μs3.17x faster--NA
Single_System_Numerics_TensorVector512Single11111111.599 μs0.3649 μs11.631 μs2.95x faster--NA
Single_Parallel_ForVector512Single11111141.344 μs1.2447 μs40.946 μs1.20x slower0.54934932 BNA
Single_Parallel_InvokeVector512Single1111118.909 μs0.1620 μs8.879 μs3.86x faster0.59514949 BNA
Single_Parallel_Invoke_SIMDVector512Single1111115.552 μs0.1235 μs5.537 μs6.19x faster0.54174531 BNA

Conclusions

The benchmarks demonstrate an improvement at each stage of the implementation. While parallelization on the CPU requires more complex code, it results in enhanced efficiency.

The complexity can be reduced by using libraries like System.Numerics.Tensors or NetFabric.Numerics.Tensors, which make use of value-type delegates so that code similar to that shown in this article can be reused.

Despite the CPU large number of cores used, the performance gains from multi-threading appear limited and I do not yet understand why.

Feedback and suggestions for improving the article’s content are appreciated.

This post is licensed under CC BY 4.0 by the author.