optimizing-code-for-detecting-adjacent-items-in-an-array-with-simd-instructions

Solution

While compilers can sometimes auto-vectorize C++ code, all mainstream compilers fail to do that on this simple function.

There are several approach to vectorize a C++ code.

C SIMD intrinsics

The most famous is C SIMD intrinsics which are great for performance but not portable and makes your code really ugly and also harder to maintain/debug. On x86-64 CPUs, you can write a code targetting SSE4.2 which is supported on nearly-all modern CPUs, but it will not be very efficient. A lot of x86-64 CPUs supports AVX2 so you can write a code for that which will be pretty efficient, but it will simply crash on rather-old CPUs not supporting it (~6% based on the Steam survey). For very good performance on large arrays, AVX-512 is the way to go, but it is supported only on a fraction of client-side CPUs (12% based on the Steam survey). Most recent server-side CPUs support it though. On ARM, Neon is generally the way to go though some recent CPUs support SVE. This SIMD instruction set fragmentation is a pain when writing C/C++ code using C SIMD intrinsics. Thus, I do not advise you to use C SIMD intrinsics unless you target only one specific platform.

C++ SIMD libraries

A frequent alternative solution is to use a C++ library for that since the experimental SIMD technical specification is not merged into the C++ standard yet. Such libraries include XSIMD, Google's Highway, VC. A SIMD library may not support all features provided by your target CPU (masking/predication is often a pain with a lot of C++ SIMD libraries), nor even a lot of CPU architectures. Its performance is dependent of many parameters like the target code, the target CPU architecture and the specific target SIMD instruction set. In general, you should compile the code for one specific SIMD instruction set. It means not supporting old CPU or being possibly sub-optimal on recent CPUs (unless you manually recompile the code for each possible SIMD instruction set and write a code to pick the best implementation which is tedious IMHO, especially for just one simple function).

OpenMP

Another alternative is to use OpenMP. SIMD directives are currently pretty limited and there is no guarantee that the code will be properly vectorized on the target compiler so you should check the assembly code to be sure. This solution is certainly the simplest though the generated code is generally sub-optimal. This is a good solution to easily get a faster portable code. In practice, OpenMP do not support a return statement in vectorized loops so the code must be adapted to operate on chunks. Moreover, lazy boolean operators tends not to be well vectorized by mainstream compilers. The 3 mainstream compilers seems to generate SIMD instructions in this case (though the MSVC warning is worrying) with the right compiler flags if the reduced variable is not a boolean (see on Godbolt). Here is the code:

#include <cstdint>
#include <algorithm>

int check_adjacent_impl(uint8_t *arr, int size, int a, int b)
{
    constexpr int chunkSize = 256;
    const int chunkLimit = (size - 1) / chunkSize * chunkSize;

    for (int i = 0; i < size - 1; i += chunkSize)
    {
        int res = 0;

        #pragma omp simd reduction(|:res)
        for (int j = i; j < std::min(i + chunkSize, size - 1); ++j)
            res |= ((arr[j] == a) & (arr[j + 1] == b)) | ((arr[j] == b) & (arr[j + 1] == a));

        if(res)
            return 1;
    }

    return 0;
}

The generated code is clearly inefficient. However, using 8-bit variable (a, b and res) results in a much better assembly code (see on Godbolt). GCC still generate a sub-optimal code, but the one of Clang and MSVC are pretty good IMHO, especially the one of Clang:

.L5:
        vmovdqu ymm1, YMMWORD PTR [r13+0+rax]
        vmovdqu ymm2, YMMWORD PTR [rsi+rax]
        add     rax, 32
        vpcmpeqb        ymm0, ymm1, ymm4
        vpcmpeqb        ymm8, ymm2, ymm3
        vpcmpeqb        ymm1, ymm1, ymm3
        vpcmpeqb        ymm2, ymm4, ymm2
        vpand   ymm0, ymm0, ymm8
        vpand   ymm1, ymm1, ymm2
        vpor    ymm0, ymm0, ymm1
        vpand   ymm0, ymm0, ymm5
        vpor    ymm7, ymm7, ymm0
        cmp     rdx, rax
        jne     .L5

ISPC

Yet another solution is to simply use a SIMD-friendly language like ISPC or OpenCL (the later is better suited for accelerators like GPUs though). ISPC supports x86-like CPUs and ARM ones. It is mainly optimized for Intel CPUs and for 32-bit types. It is great to easily vectorize non-trivial code and get pretty-good performance (often sub-optimal though). Predication, swizzle, gathers and reduction operators are natively supported (and rather trivial to do). The code is portable and easier to maintain than with C SIMD intrinsics. Explicit vectorization guarantee that the code is vectorized as opposed to auto-vectorization or OpenMP directives. It also helps to optimize some codes better than most compilers (especially for skilled developers). The code tends to be more readable than C++ SIMD library though ISPC is a different language not supporting all C++ features so the integration with a C++ code can be sometime complicated. Note that ISPC is not really well suited for SIMD codes requiring a fixed vector size (eg. only 4 items) and AFAIK some very-specialized platform-specific intrinsics are not available (eg. _mm_mpsadbw_epu8). The big advantage of ISPC is that you can tell the compiler to generate codes for SSE2, SSE4.2, AVX2 and AVX-512 and choose the best implementation at runtime based on the target CPU with no development effort. Here is an (untested) example for your function (see on Godbolt):

static inline unmasked uniform bool checkBlock(uniform uint8 arr[], uniform int i, uniform uint8 a, uniform uint8 b)
{
    const int j = i + programIndex;
    const uint8 v1 = arr[j];
    const uint8 v2 = arr[j + 1];
    return any(((v1 == a) & (v2 == b)) | ((v1 == b) & (v2 == a)));
}

export uniform int check_adjacent_impl(uniform uint8 arr[], uniform int size, uniform uint8 a, uniform uint8 b)
{
    const uniform int vecSize = (size - 1) / programCount * programCount;

    // Main loop
    for(uniform int i = 0; i < vecSize; i += programCount)
        if(checkBlock(arr, i, a, b))
            return 1;
    
    // Last remaining items
    if(vecSize + programIndex < size - 1 && checkBlock(arr, vecSize, a, b))
        return 1;

    return 0;
}

On x86-64 CPUs, you can compile it with ispc code.ispc --target=sse2-i32x4,sse4.1-i8x16,avx2-i8x32,avx512skx-x64 (you may need --dllexport on Windows, or -pic on Linux regarding your needs) and then link the generated object files. It will run on all x86-64 CPUs and be pretty efficient. On ARM you just need to write --target=neon-i8x16 instead and that all (SVE is not currently supported). For example, here is the generated hot-loop AVX2 code:

.LBB1_3:
        vmovdqu ymm2, ymmword ptr [rdi + r9]
        vmovdqu ymm3, ymmword ptr [rdi + r9 + 1]
        vpcmpeqb        ymm4, ymm2, ymm0
        vpcmpeqb        ymm5, ymm3, ymm1
        vpand   ymm4, ymm5, ymm4
        vpcmpeqb        ymm2, ymm2, ymm1
        vpcmpeqb        ymm3, ymm3, ymm0
        vpand   ymm2, ymm3, ymm2
        vpor    ymm2, ymm4, ymm2
        vpmovmskb       r10d, ymm2
        test    r10d, r10d
        jne     .LBB1_4
        add     r9, 32
        cmp     r9, rsi
        jl      .LBB1_3

This seems very good IMHO (far better than your scalar code resulting in an assembly code full of conditional branches). It is similar to the OpenMP code generated with Clang but the ISPC code can stop earlier (because we do not operate on chunks in the ISPC code). This code should be dozens of times faster than your C++ scalar code on relatively-small arrays (ie. arrays fitting in your CPU L1/L2 caches). For large arrays, it should be memory-bound.

2024-07-24

Jérôme Richard

Optimizing code for detecting adjacent items in an array with SIMD instructions?

C SIMD intrinsics

C++ SIMD libraries

OpenMP

ISPC