Post

Unit testing and benchmarking SIMD in .NET

Single Instruction, Multiple Data (SIMD) is a technique used in computer architecture that allows for parallel processing on multiple data elements simultaneously. I’ve touched on the advantages of SIMD and its implementation in .NET in a previous article.

The performance of SIMD depends on the hardware that the application is running on. Except for a few specific scenarios, developers must ensure that their applications can function in any situation. This includes operating on systems that either don’t support SIMD, or support 128-bit, 256-bit, or even 512-bit SIMD. Consequently, to achieve comprehensive coverage during unit testing, it’s crucial to conduct tests across all these diverse systems. Additionally, it’s necessary to perform performance tests on these systems to check if the application is using the system’s capabilities effectively. Even if you don’t directly use SIMD, your dependencies may make use of it.

However, this doesn’t mean that these tests have to be carried out on separate physical systems. These features can be disabled via software by modifying environment variables. This article mainly focuses on identifying the variables that need to be modified and how to do so during local unit testing, continuous integration pipeline testing, and performance testing.

Environment Variables

.NET provides the ability to disable hardware features using the environment variables listed below:

  • DOTNET_EnableHWIntrinsic - Toggles SIMD support.
  • DOTNET_EnableAVX2 - Toggles 256-bit SIMD support.
  • DOTNET_EnableAVX512F - Toggles 512-bit SIMD support.

You can enable these features by setting the variables to 1 and disable them by setting to 0. By default, the value is set to 1.

The following settings correspond to the given scenarios:

  • No SIMD:
    • DOTNET_EnableHWIntrinsic - 0
  • Support 128-bit SIMD:
    • DOTNET_EnableHWIntrinsic - 1
    • DOTNET_EnableAVX2 - 0
    • DOTNET_EnableAVX512F - 0
  • Support up to 256-bit SIMD:
    • DOTNET_EnableHWIntrinsic - 1
    • DOTNET_EnableAVX2 - 1
    • DOTNET_EnableAVX512F - 0
  • Support up to 512-bit SIMD:
    • DOTNET_EnableHWIntrinsic - 1
    • DOTNET_EnableAVX2 - 1
    • DOTNET_EnableAVX512F - 1

Keep in mind that these variables only provide the option to disable features. If you aim to test the application with 512-bit SIMD, it’s necessary for the hardware to support it. You can only force the application to use a lower bit SIMD or even turn off all SIMD.

Keep in mind that Vector<T> will consistently utilize the maximum bit count available, while the availability of Vector128<T>, Vector256<T>, and Vector512<T> depends on the supported bit count and may be enabled or disabled accordingly.

Unit Testing

There are various methods to execute unit tests in .NET. In this section, I will discuss the ones that I have used and successfully configured to handle multiple scenarios.

Utilizing a .runsettings File to Configure Unit Tests

A .runsettings file can be employed to control how unit tests are run. This allows for the adjustment of various settings, including environment variables.

You can create multiple .runsettings files for different configurations. I typically generate the following files and place them within the repository folder structure:

_Scalar.runsettings

1
2
3
4
5
6
7
8
<?xml version="1.0" encoding="utf-8"?>
<RunSettings>
    <RunConfiguration>
        <EnvironmentVariables>
            <DOTNET_EnableHWIntrinsic>0</DOTNET_EnableHWIntrinsic>
        </EnvironmentVariables>
    </RunConfiguration>
</RunSettings>

_Vector128.runsettings

1
2
3
4
5
6
7
8
9
10
<?xml version="1.0" encoding="utf-8"?>
<RunSettings>
    <RunConfiguration>
        <EnvironmentVariables>
            <DOTNET_EnableHWIntrinsic>1</DOTNET_EnableHWIntrinsic>
            <DOTNET_EnableAVX2>0</DOTNET_EnableAVX2>
            <DOTNET_EnableAVX512F>0</DOTNET_EnableAVX512F>
        </EnvironmentVariables>
    </RunConfiguration>
</RunSettings>

_Vector256.runsettings

1
2
3
4
5
6
7
8
9
<?xml version="1.0" encoding="utf-8"?>
<RunSettings>
    <RunConfiguration>
        <EnvironmentVariables>
            <DOTNET_EnableHWIntrinsic>1</DOTNET_EnableHWIntrinsic>
            <DOTNET_EnableAVX512F>0</DOTNET_EnableAVX512F>
        </EnvironmentVariables>
    </RunConfiguration>
</RunSettings>

_Vector512.runsettings

1
2
3
4
5
6
7
8
<?xml version="1.0" encoding="utf-8"?>
<RunSettings>
    <RunConfiguration>
        <EnvironmentVariables>
            <DOTNET_EnableHWIntrinsic>1</DOTNET_EnableHWIntrinsic>
        </EnvironmentVariables>
    </RunConfiguration>
</RunSettings>

I append a suffix _ to the file names just to make it easier to locate them in a list of files.

In the following sections, I will explain how to utilize these files in various development environments.

Visual Studio

Visual Studio has the capability to automatically detect a .runsettings file and utilize it when executing unit tests. However, in instances where multiple .runsettings files are present, manual selection of the desired file is required.

To do this in the IDE, navigate to Test > Configure Run Settings > Select Solution Wide runsettings File, and then choose the .runsettings file with the settings you wish to apply.

You can now execute the unit tests using the settings from the selected file.

Command Line

You can run the unit tests from the command line by using the dotnet test command. This comes in handy when working with Visual Studio Code or when configuring continuous integration pipelines.

The dotnet test command permits the configuration of environment variables using the --environment option or its shorter form -e. Here’s an example of how it can be used to support 128-bit SIMD:

1
dotnet test -e:DOTNET_EnableAVX2=0 -e:DOTNET_EnableAVX512F=0

The dotnet test command can also accommodates the use of .runsettings files with the --settings option or its abbreviated form -s. Here’s the equivalent example using a setting file:

1
dotnet test -s:_Vector128.runsettings

Performance Testing

For .NET performance testing, I consistently utilize BenchmarkDotNet. It’s a tool that not only provides precise results but is also user-friendly.

BenchmarkDotNet introduces the concept of jobs, which allows you to run the same benchmarks under varying conditions and compare their performance. While there are numerous ways to configure these jobs, my preferred method is to define a configuration class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Reports;
using System.Runtime.Intrinsics;

class VectorizationConfig : ManualConfig
{
    public VectorizationConfig()
    {
        _ = WithSummaryStyle(SummaryStyle.Default.WithRatioStyle(RatioStyle.Trend));
        _ = HideColumns(Column.EnvironmentVariables, Column.RatioSD, Column.Error);
        _ = AddJob(Job.Default.WithId("Scalar")
            .WithEnvironmentVariable("DOTNET_EnableHWIntrinsic", "0")
            .AsBaseline());
        if (Vector128.IsHardwareAccelerated)
        {
            _ = AddJob(Job.Default.WithId("Vector128")
                    .WithEnvironmentVariable("DOTNET_EnableAVX2", "0")
                    .WithEnvironmentVariable("DOTNET_EnableAVX512F", "0"));
        }
        if (Vector256.IsHardwareAccelerated)
        {
            _ = AddJob(Job.Default.WithId("Vector256")
                .WithEnvironmentVariable("DOTNET_EnableAVX512F", "0"));
        }
        if (Vector512.IsHardwareAccelerated)
        {
            _ = AddJob(Job.Default.WithId("Vector512"));
        }
    }
}

Note: If you’re intrigued about why I use the discards in this code, you might find my other article “Defensive Coding in C#: A Closer Look at Unchecked Return Value Discards” interesting.

This class adds a job named Scalar that executes the benchmarks with SIMD disabled and sets it as the baseline. If the hardware supports 128-bit SIMD, it adds a job named Vector128, and similarly for 256-bit and 512-bit SIMD.

You can then use this class in conjunction with the ConfigAttribute. Simply add [Config(typeof(VectorizationConfig))] to your benchmarking classes. The advantage of this approach is that you can apply this attribute to only those benchmarks that you wish to test under multiple SIMD scenarios.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using System.Numerics.Tensors;

namespace NetFabric.Numerics.Tensors.Benchmarks;

[Config(typeof(VectorizationConfig))]
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByCategory)]
[CategoriesColumn]
public class SumBenchmarks
{
    int[]? arrayInt;
    float[]? arrayFloat;

    [Params(5, 100)]
    public int Count { get; set; }

    [GlobalSetup]
    public void GlobalSetup()
    {
        arrayInt = new int[Count];
        arrayFloat = new float[Count];

        var random = new Random(42);
        for(var index = 0; index < Count; index++)
        {
            arrayInt[index] = random.Next(10);
            arrayFloat[index] = random.Next(10);
        }
    }

    [BenchmarkCategory("Int")]
    [Benchmark(Baseline = true)]
    public int Baseline_Int()
        => Baseline.Sum<int>(arrayInt!);

    [BenchmarkCategory("Int")]
    [Benchmark]
    public int LINQ_Int()
        => Enumerable.Sum(arrayInt!);

    [BenchmarkCategory("Int")]
    [Benchmark]
    public int System_Int()
        => TensorPrimitives.Sum<int>(arrayInt!);

    [BenchmarkCategory("Int")]
    [Benchmark]
    public int NetFabric_Int()
        => TensorOperations.Sum<int>(arrayInt!);

    [BenchmarkCategory("Float")]
    [Benchmark(Baseline = true)]
    public float Baseline_Float()
        => Baseline.Sum<float>(arrayFloat!);

    [BenchmarkCategory("Float")]
    [Benchmark]
    public float System_Float()
        => TensorPrimitives.Sum(arrayFloat!);

    [BenchmarkCategory("Float")]
    [Benchmark]
    public float NetFabric_Float()
        => TensorOperations.Sum<float>(arrayFloat!);
}

Outputs the following:

1
2
3
4
5
6
7
8
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3007/23H2/2023Update/SunValley3)
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.101
  [Host]    : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Scalar    : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT
  Vector128 : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX
  Vector256 : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
  Vector512 : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
MethodJobCategoriesCountMeanStdDevRatio
Baseline_FloatScalarFloat51.856 ns0.0284 nsbaseline
LINQ_FloatScalarFloat52.765 ns0.0696 ns1.49x slower
System_FloatScalarFloat51.820 ns0.0235 ns1.02x faster
NetFabric_FloatScalarFloat51.734 ns0.0292 ns1.07x faster
Baseline_FloatVector128Float51.810 ns0.0304 ns1.03x faster
LINQ_FloatVector128Float52.745 ns0.0475 ns1.48x slower
System_FloatVector128Float53.286 ns0.0605 ns1.77x slower
NetFabric_FloatVector128Float51.908 ns0.0242 ns1.03x slower
Baseline_FloatVector256Float51.790 ns0.0287 ns1.04x faster
LINQ_FloatVector256Float52.661 ns0.0502 ns1.43x slower
System_FloatVector256Float52.318 ns0.0145 ns1.25x slower
NetFabric_FloatVector256Float51.921 ns0.0251 ns1.04x slower
Baseline_FloatVector512Float51.820 ns0.0176 ns1.02x faster
LINQ_FloatVector512Float52.759 ns0.0537 ns1.49x slower
System_FloatVector512Float52.313 ns0.0381 ns1.25x slower
NetFabric_FloatVector512Float51.896 ns0.0266 ns1.02x slower
       
Baseline_FloatScalarFloat10038.293 ns0.2998 nsbaseline
LINQ_FloatScalarFloat10059.372 ns0.8637 ns1.55x slower
System_FloatScalarFloat10038.491 ns0.5265 ns1.01x slower
NetFabric_FloatScalarFloat10012.650 ns0.1233 ns3.03x faster
Baseline_FloatVector128Float10038.962 ns0.3469 ns1.02x slower
LINQ_FloatVector128Float10058.546 ns0.6409 ns1.53x slower
System_FloatVector128Float1006.212 ns0.1246 ns6.17x faster
NetFabric_FloatVector128Float1007.835 ns0.1275 ns4.89x faster
Baseline_FloatVector256Float10038.674 ns0.4669 ns1.01x slower
LINQ_FloatVector256Float10058.555 ns0.7268 ns1.53x slower
System_FloatVector256Float1004.528 ns0.0678 ns8.46x faster
NetFabric_FloatVector256Float1005.540 ns0.0867 ns6.91x faster
Baseline_FloatVector512Float10038.833 ns0.6120 ns1.01x slower
LINQ_FloatVector512Float10059.147 ns0.8935 ns1.54x slower
System_FloatVector512Float1005.140 ns0.0444 ns7.44x faster
NetFabric_FloatVector512Float1005.351 ns0.0591 ns7.16x faster
       
Baseline_IntScalarInt51.837 ns0.0382 nsbaseline
LINQ_IntScalarInt51.858 ns0.0413 ns1.01x slower
System_IntScalarInt51.617 ns0.0285 ns1.14x faster
NetFabric_IntScalarInt51.906 ns0.0245 ns1.04x slower
Baseline_IntVector128Int51.811 ns0.0337 ns1.01x faster
LINQ_IntVector128Int51.819 ns0.0339 ns1.01x faster
System_IntVector128Int53.489 ns0.0671 ns1.90x slower
NetFabric_IntVector128Int52.133 ns0.0365 ns1.16x slower
Baseline_IntVector256Int51.822 ns0.0278 ns1.01x faster
LINQ_IntVector256Int51.803 ns0.0343 ns1.02x faster
System_IntVector256Int52.323 ns0.0208 ns1.27x slower
NetFabric_IntVector256Int52.510 ns0.0252 ns1.37x slower
Baseline_IntVector512Int51.826 ns0.0226 ns1.01x faster
LINQ_IntVector512Int52.021 ns0.0268 ns1.10x slower
System_IntVector512Int52.111 ns0.0266 ns1.15x slower
NetFabric_IntVector512Int52.510 ns0.0232 ns1.37x slower
       
Baseline_IntScalarInt10027.237 ns0.2286 nsbaseline
LINQ_IntScalarInt10028.539 ns0.5145 ns1.05x slower
System_IntScalarInt10028.771 ns0.4149 ns1.06x slower
NetFabric_IntScalarInt10011.610 ns0.1033 ns2.35x faster
Baseline_IntVector128Int10026.946 ns0.3698 ns1.01x faster
LINQ_IntVector128Int1008.458 ns0.0746 ns3.22x faster
System_IntVector128Int1005.594 ns0.0676 ns4.87x faster
NetFabric_IntVector128Int1006.485 ns0.1132 ns4.20x faster
Baseline_IntVector256Int10026.840 ns0.3518 ns1.01x faster
LINQ_IntVector256Int1005.974 ns0.0593 ns4.56x faster
System_IntVector256Int1004.477 ns0.0561 ns6.09x faster
NetFabric_IntVector256Int1005.802 ns0.0658 ns4.69x faster
Baseline_IntVector512Int10027.199 ns0.1354 ns1.00x faster
LINQ_IntVector512Int1005.869 ns0.0614 ns4.64x faster
System_IntVector512Int1004.434 ns0.0766 ns6.14x faster
NetFabric_IntVector512Int1005.564 ns0.0544 ns4.90x faster
       
This post is licensed under CC BY 4.0 by the author.