A different kind of C#: going wide

Writing efficient SIMD-accelerated code can be tricky. Consider a single cross product:

result.X = a.Y * b.Z - a.Z * b.Y;
result.Y = a.Z * b.X - a.X * b.Z;
result.Z = a.X * b.Y - a.Y * b.X;

If you'd like to try a brain teaser, imagine you’ve got subtract, multiply, and lane shuffle operations that can handle 4 lanes of data at the same time. Try to come up with the minimal number of operations needed to express the same mathematical result for a single pair of input 3d vectors.

Then, compare it to the scalar version. Without spoiling much, it's not going to be 4x faster with 4-wide operations.

Problems like this show up any time you try to narrowly vectorize scalar-ish code. You rarely get the benefit of 128 bit operations, let alone newer 256 bit or 512 bit operations. There just aren't enough independent lanes of computation to saturate the machine with real work.

Fortunately, most compute intensive programs involve doing one thing many times to a bunch of instances. If you build your application from the ground up assuming this batch processing model, it doesn’t even involve more difficulty than traditional approaches (though it does have different difficulties).

Let’s assume we’ve got a nice big batch of 2^20 operations to perform, each computing:

result = dot(dot(cross(a, b), a) * b, dot(cross(c, d), c) * d)

Array of Structures

We’ll start with a naive scalar implementation and a narrowly vectorized version as a basis for comparison. They could look something like this:

public struct Vector3AOS
  public float X;
  public float Y;
  public float Z;
for (int i = 0; i < LaneCount; ++i)
  ref var lane = ref input[i];
  Vector3AOS.CrossScalar(lane.A, lane.B, out var axb);
  Vector3AOS.CrossScalar(lane.C, lane.D, out var cxd);
  var axbDotA = Vector3AOS.DotScalar(axb, lane.A);
  var cxdDotC = Vector3AOS.DotScalar(cxd, lane.C);
  Vector3AOS.ScaleScalar(lane.B, axbDotA, out var left);
  Vector3AOS.ScaleScalar(lane.D, cxdDotC, out var right);
  results[i] = Vector3AOS.DotScalar(left, right);
for (int i = 0; i < LaneCount; ++i)
  ref var lane = ref input[i];
  Vector3AOS.Cross3ShuffleSSE(lane.A, lane.B, out var axb);
  Vector3AOS.Cross3ShuffleSSE(lane.C, lane.D, out var cxd);
  Vector3AOS.DotSSE(axb, lane.A, out var axbDotA);
  Vector3AOS.DotSSE(cxd, lane.C, out var cxdDotC);
  Vector3AOS.ScaleSSE(lane.B, axbDotA, out var left);
  Vector3AOS.ScaleSSE(lane.D, cxdDotC, out var right);
  Vector3AOS.DotSSE(left, right, out var result);
  results[i] = Sse41.Extract(result, 0);

(The full implementation for all of these approaches can be found in the GoingWide project in the scratchpad repo.)

They’re both very similar at the level of memory accesses. Load the relevant whole struct instance from the array (hence array of structures!), do some work with it, store it out. At a lower level, the SSE version is a bit weirder. For example, the Vector3AOS.Cross3ShuffleSSE function is not quite as intuitive as the earlier cross product code:

const byte yzxControl = (3 << 6) | (0 << 4) | (2 << 2) | 1;
var shuffleA = Sse.Shuffle(a, a, yzxControl);
var shuffleB = Sse.Shuffle(b, b, yzxControl);
result = Sse.Subtract(Sse.Multiply(a, shuffleB), Sse.Multiply(b, shuffleA));
result = Sse.Shuffle(result, result, yzxControl);

Only three of those intrinsics calls are actually doing arithmetic.

Despite its ugliness, the SSE variant is faster even with the current alpha state of intrinsics support by about 25-30%. Still a far cry from 4x faster.

To do better than this, we need to make use of the fact that we’re doing the same thing a million times in a row.

Structure of Arrays

Array of structures (AOS) lays out memory instance by instance. Each struct contains all the data associated with one instance. That’s pretty intuitive and convenient, but as we’ve seen, it’s not a natural representation for wide vectorization. At best, you’d have to convert the AOS memory layout into a vectorized form at runtime, perhaps walking through multiple instances and creating vectorized bundles of instance properties. But that takes valuable time that we could be spending on actual computation!

What if we just stored every instance property in its own big array?

Buffer<float> ax, ay, az, bx, by, bz, cx, cy, cz, dx, dy, dz, result;

public override void Execute()
  for (int i = 0; i < LaneCount; i += Vector<float>.Count)
    ref var ax = ref Unsafe.As<float, Vector<float>>(ref this.ax[i]);
    ref var ay = ref Unsafe.As<float, Vector<float>>(ref this.ay[i]);
    ref var az = ref Unsafe.As<float, Vector<float>>(ref this.az[i]);
    ref var bx = ref Unsafe.As<float, Vector<float>>(ref this.bx[i]);
    ref var by = ref Unsafe.As<float, Vector<float>>(ref this.by[i]);
    ref var bz = ref Unsafe.As<float, Vector<float>>(ref this.bz[i]);
    ref var cx = ref Unsafe.As<float, Vector<float>>(ref this.cx[i]);
    ref var cy = ref Unsafe.As<float, Vector<float>>(ref this.cy[i]);
    ref var cz = ref Unsafe.As<float, Vector<float>>(ref this.cz[i]);
    ref var dx = ref Unsafe.As<float, Vector<float>>(ref this.dx[i]);
    ref var dy = ref Unsafe.As<float, Vector<float>>(ref this.dy[i]);
    ref var dz = ref Unsafe.As<float, Vector<float>>(ref this.dz[i]);

    Cross(ax, ay, az, bx, by, bz, out var axbx, out var axby, out var axbz);
    Cross(cx, cy, cz, dx, dy, dz, out var cxdx, out var cxdy, out var cxdz);
    Dot(axbx, axby, axbz, ax, ay, az, out var axbDotA);
    Dot(cxdx, cxdy, cxdz, cx, cy, cz, out var cxdDotC);
    Scale(bx, by, bz, axbDotA, out var leftx, out var lefty, out var leftz);
    Scale(bx, by, bz, cxdDotC, out var rightx, out var righty, out var rightz);
    Dot(leftx, lefty, leftz, rightx, righty, rightz, out Unsafe.As<float, Vector<float>>(ref result[i]));

(Note that the above implementation is uglier than it needs to be- the goal was to keep it conceptually separated from the next approach, but a real implementation could be less verbose.)

This turns out to be about 2.1x faster than the original scalar version. That’s an improvement, even if not quite as large as we’d like. Memory bandwidth is starting to block some cycles but it’s not the only issue.

One potential concern when using SOA layouts is an excessive number of prefetcher address streams. Keeping track of a bunch of different memory locations can hit bottlenecks in several parts of the memory system even if each access stream is fully sequential. In the above, the 13 active address streams are enough to be noticeable; if expressed in the same kind of SOA layout, the current Hinge constraint would have 47 streams for the projection data alone.

You could also try changing the loop to instead perform each individual operation across all lanes before moving on to the next one (i.e. multiply all ay * bz, store it in a temporary, then do all az * by and store it, and so on). That would indeed eliminate address streams as a concern given that there would only ever be two or three in flight at a time. Unfortunately, every single operation would evict the L1, L2, and with a million instances on many consumer processors, even all of L3. No data reuse is possible and the end result is a massive memory bandwidth bottleneck and about 3.5x worse performance than the naive scalar version. Now consider what it might look like if the loop were modified to process a cache friendly block of properties at a time…

Tying it together: Array Of Structures Of Arrays

We want something that:

  1. allows efficient vectorized execution,

  2. is usable without SPMD/SIMT tooling (see later section),

  3. feels fairly natural to use (preserving scalar-ish logic),

  4. is cache friendly without requiring macro-level tiling logic, and

  5. doesn't abuse the prefetcher.

There is an option that meets all these requirements: AOSOA, Array Of Structures Of Arrays. Take the bundled Structure Of Arrays layout from earlier, restrict the array lengths to be the size of a single SIMD bundle, shove all of those smaller bundle-arrays into a struct, and store all those bundle-array containing structs in an array. Or, to put it in a less grammatically torturous way:

public unsafe struct Vector3AOSOANumerics
  public Vector<float> X;
  public Vector<float> Y;
  public Vector<float> Z;

struct Input
  public Vector3AOSOANumerics A;
  public Vector3AOSOANumerics B;
  public Vector3AOSOANumerics C;
  public Vector3AOSOANumerics D;

Buffer<Input> input;
Buffer<Vector<float>> results;

public override void Execute()
  for (int i = 0; i < LaneCount / Vector<float>.Count; ++i)
    ref var lane = ref input[i];
    Vector3AOSOANumerics.Cross(lane.A, lane.B, out var axb);
    Vector3AOSOANumerics.Cross(lane.C, lane.D, out var cxd);
    Vector3AOSOANumerics.Dot(axb, lane.A, out var axbDotA);
    Vector3AOSOANumerics.Dot(cxd, lane.C, out var cxdDotC);
    Vector3AOSOANumerics.Scale(lane.B, axbDotA, out var left);
    Vector3AOSOANumerics.Scale(lane.D, cxdDotC, out var right);
    Vector3AOSOANumerics.Dot(left, right, out results[i]);

If you squint a little bit and ignore the names, it looks reasonably close to a scalar version. Not as pretty as if we could use operators without worrying about codegen quality, but still not bad.

Performance wise, this runs about 2.4 times as fast as the scalar version. Not 4x, but as you might guess by now that is partially caused by a memory bandwidth bottleneck. The benchmark simply doesn’t have enough math per loaded byte. (If you change the code to work in L1 only, it bumps up to 3.24x faster than the scalar version. Still not perfect scaling for the 4 wide operations used on my machine, but closer- and the gap can close with additional compiler improvements.)

AOSOA layouts are used in almost every computationally expensive part of the engine. All constraints and almost all convex collision detection routines use it. In most simulations, these widely vectorized codepaths make up the majority of execution time. That’s a big part of why v2 is so speedy.

On autovectorization and friends

“Sufficiently smart compilers” could, in theory, take the first naive scalar implementation and transform it directly into a fully vectorized form. They could even handle the annoying case where your data doesn’t line up with the SIMD width and a fallback implementation for the remainder would be better (which I’ve conveniently ignored in this post). With no programmer effort whatsoever, you could take advantage of a machine’s full capacity! How wonderful!

Unfortunately, the CoreCLR JIT does not support autovectorization at the moment, and even offline optimizing compilers struggle. C, C++, C# and all their friends are extremely permissive languages that simply don’t provide enough guarantees for the compiler to safely transform code in the extreme ways that vectorization sometimes demands.

In practice, the promising magic of autovectorization feels more like coaxing a dog across a tile floor. “Trust me, it’s okay girl, c’mere,” you say, “I know it’s sometimes slippery, but I promise you’re really going to be okay, I made sure you won’t hit anything, look, I’ve got your favorite tasty PragmaSimd treat”- and then she takes a step forward! And then she yelps and steps back onto the carpet and you go back to reading your dog’s optimization manual.

But autovectorization and intrinsics aren’t the only way to handle things. By restricting language freedom a little bit to give the compiler additional guarantees, you can write simple scalar-looking code that gets transformed into (mostly) efficient vectorized code. The most common examples of this are GPU shading languages like HLSL. Each ‘thread’ of execution on modern GPUs behaves a lot like a single lane of CPU vector instructions (though GPUs tend to use wider execution units and different latency hiding strategies, among other nuances).

This approach is usually called SPMD (Single Program Multiple Data) or SIMT (Single Instruction Multiple Thread) and it can achieve performance very close to hand written intrinsics with far less work. It’s less common in CPU land, but there are still options like ISPC and OpenCL. (Unity’s Burst may qualify, but it’s still in development and I’m unclear on where exactly it sits on the spectrum of autovectorizer to full-blown SPMD transform.)

Further, even in HLSL, you are still responsible for an efficient memory layout. Randomly accessing a bunch of properties scattered all over the place will obliterate performance even if all the code is optimally vectorized.

So there’s no fully general magic solution for all C# projects at the moment. If we want to maximize performance, we have to do it the hard(er) way.

One size doesn’t fit all

Clearly, not everything needs to be vectorized, and not everything needs to use an AOSOA layout. Even when performance is important, AOSOA is not automatically the right choice. Always defer to memory access patterns.

One example of this in bepuphysics v2 is body data. All of it is stored in regular old AOS format even though the PoseIntegrator does some nontrivial math with it. Why? Because the PoseIntegrator isn’t a large chunk of frame time, it’s often already memory bandwidth bound, and body data is frequently requested by collision detection and constraint solving routines. Storing all of a body’s velocities in a single cache line minimizes the number of cache lines that the solver has to touch.

While bepuphysics v2 doesn’t use pure SOA layouts, they can be valuable. If you don’t have excessive numbers of properties and there are many accesses to individual properties without their neighbors, it’s a great fit. Indexing with SOA is a little easier than in AOSOA too.

Addendum: performance

Vectorization Style Performance.png

This chart shows all the different approaches and their performance on my 3770K, plus some variants that I didn’t discuss earlier in this post.

  • AOS Numerics uses the System.Numerics.Vector3 type but ends up being fractionally slower than the fully scalar path. That’s not surprising- dot products and cross products don’t have a fast path. In fact, Vector3’s cross product is implemented in a completely scalar way.

  • SOA Doofy is the ‘compute a single operation at a time and evict entire cache’ implementation noted in the SOA section. Not ideal.

  • AOSOA Intrinsics Unsafe and AOSOA Intrinsics Load/Store are a couple of investigations into codegen on the new platform intrinsics types under development. They’re actually faster when AVX is disabled in favor of SSE alone, but even then they’re still slower than the System.Numerics.Vector AOSOA implementation. Both suffer significantly from L1/store bottlenecks caused by aggressively shoving things into memory rather than just holding them in registers. (Not too surprising in the Load/Store variant where I explicitly told it to do that.)

I wouldn’t put too much stock in the intrinsics numbers at this point- I ran this on a daily build of the alpha, after all. There’s probably a way to tease out some better codegen than represented above, and if not, the situation will likely improve with further work in the compiler.