A different kind of C#: watching your step

In the previous post, we saw that the JIT does a reasonably good job generating efficient code.

Does that mean we can simply trust the compiler to do the right thing all the time? Let's find out!

We'll investigate using a very simple microbenchmark: add three different vector representations together in three different ways each. Here are the competitors:

struct VFloat
{
  public float X;
  public float Y;
  public float Z;
}
struct VNumerics3
{
  public Vector<float> X;
  public Vector<float> Y;
  public Vector<float> Z;
}
struct VAvx
{
  public Vector256<float> X;
  public Vector256<float> Y;
  public Vector256<float> Z;
}

The leftmost type, VFloat, is the simplest representation for three scalars. It’s not a particularly fair comparison since the Vector{T} type contains 4 scalars on the tested 3770K and the Vector256{float} contains 8, so they’re conceptually doing way more work. Despite that, comparing them will reveal some interesting compiler and processor properties.

The three Add implementations tested will be a manually inlined version, a static function with in/out parameters, and an operator. Here’s how the function and operator look for VFloat; I’ll omit the manually inlined implementation and other types for brevity (but you can see them on github):

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Add(in VFloat a, in VFloat b, out VFloat result)
{
  result.X = a.X + b.X;
  result.Y = a.Y + b.Y;
  result.Z = a.Z + b.Z;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static VFloat operator +(in VFloat a, in VFloat b)
{
  VFloat result;
  result.X = a.X + b.X;
  result.Y = a.Y + b.Y;
  result.Z = a.Z + b.Z;
  return result;
}

Each addition will be called several times in a loop. Some adds are independent, some are dependent. The result of each iteration gets stored into an accumulator to keep the loop from being optimized into nonexistence. Something like this:

for (int j = 0; j < innerIterationCount; ++j)
{
  ref var value = ref Unsafe.Add(ref baseValue, j);
  var r0 = value + value;
  var r1 = value + value;
  var r2 = value + value;
  var r3 = value + value;
  var i0 = r0 + r1;
  var i1 = r2 + r3;
  var i2 = i0 + i1;
  accumulator += i2;
}

Historically, using operators for value types implied a great deal of copying for both the parameters and returned value even when the function was inlined. (Allowing ‘in’ on operator parameters helps with this a little bit, at least in cases where the JIT isn’t able to eliminate the copies without assistance.)

To compensate, many C# libraries with any degree of performance sensitivity like XNA and its progeny offered ref/out overloads. That helped, but not being able to use operators efficiently always hurt readability. Having refs spewed all over your code wasn’t too great either, but in parameters (which require no call site decoration) have saved us from that in most cases.

But for maximum performance, you had to bite the bullet and manually inline. It was a recipe for an unmaintainable mess, but it was marginally faster!

Focusing on VFloat for a moment, how does that situation look today? Testing on the .NET Core 3.0.0-preview1-26829-01 alpha runtime:

VFloat Add Benchmarks.png

All effectively the same! The resulting assembly:

Manually Inlined: 
vmovss      xmm3,dword ptr [r8]  
vaddss      xmm3,xmm3,xmm3  
vmovss      xmm4,dword ptr [r8+4]  
vaddss      xmm4,xmm4,xmm4  
vmovaps     xmm5,xmm4  
vmovss      xmm6,dword ptr [r8+8]  
vaddss      xmm6,xmm6,xmm6  
vmovaps     xmm7,xmm6  
vmovaps     xmm8,xmm3  
vaddss      xmm4,xmm4,xmm5  
vmovaps     xmm5,xmm4  
vaddss      xmm6,xmm6,xmm7  
vmovaps     xmm7,xmm6  
vaddss      xmm3,xmm3,xmm8  
vmovaps     xmm8,xmm3  
vaddss      xmm4,xmm4,xmm5  
vmovaps     xmm5,xmm7  
vaddss      xmm5,xmm5,xmm6  
vaddss      xmm3,xmm3,xmm8  
vaddss      xmm0,xmm0,xmm3  
vaddss      xmm1,xmm1,xmm4  
vaddss      xmm2,xmm2,xmm5  
Operator:
vmovss      xmm3,dword ptr [r8]  
vaddss      xmm3,xmm3,xmm3  
vmovaps     xmm4,xmm3  
vmovss      xmm5,dword ptr [r8+4]  
vaddss      xmm5,xmm5,xmm5  
vmovaps     xmm6,xmm5  
vmovss      xmm7,dword ptr [r8+8]  
vaddss      xmm7,xmm7,xmm7  
vmovaps     xmm8,xmm7  
vaddss      xmm3,xmm3,xmm4  
vmovaps     xmm4,xmm3  
vaddss      xmm5,xmm5,xmm6  
vmovaps     xmm6,xmm5  
vaddss      xmm7,xmm7,xmm8  
vmovaps     xmm8,xmm7  
vaddss      xmm3,xmm3,xmm4  
vmovaps     xmm4,xmm6  
vaddss      xmm4,xmm4,xmm5  
vmovaps     xmm5,xmm8  
vaddss      xmm5,xmm5,xmm7  
vaddss      xmm0,xmm0,xmm3  
vaddss      xmm1,xmm1,xmm4  
vaddss      xmm2,xmm2,xmm5  

The manually inlined version and the operator version differ by a single instruction. That’s good news- using operators is, at least in some cases, totally fine now! Also, note that there are only 12 vaddss instructions, cutting out the other 12 redundant adds. Some cleverness!

Now let’s see how things look across all the test cases…

3.0.0-preview1-26829-01 Add Benchmarks.png

Oh, dear. The preview nature of this runtime has suddenly become relevant. Using an operator for the VAvx type is catastrophic. Comparing the manually inlined version to the operator version:

Manually Inlined:
vmovups     ymm3,ymmword ptr [r9]  
cmp         dword ptr [r8],r8d  
lea         r9,[r8+20h]  
vmovups     ymm4,ymmword ptr [r9]  
cmp         dword ptr [r8],r8d  
add         r8,40h  
vaddps      ymm5,ymm3,ymm3  
vaddps      ymm6,ymm4,ymm4  
vmovups     ymm7,ymmword ptr [r8]  
vaddps      ymm8,ymm7,ymm7  
vaddps      ymm9,ymm3,ymm3  
vaddps      ymm10,ymm4,ymm4  
vaddps      ymm11,ymm7,ymm7  
vaddps      ymm12,ymm3,ymm3  
vaddps      ymm13,ymm4,ymm4  
vaddps      ymm14,ymm7,ymm7  
vaddps      ymm3,ymm3,ymm3  
vaddps      ymm4,ymm4,ymm4  
vaddps      ymm7,ymm7,ymm7  
vaddps      ymm6,ymm6,ymm10  
vaddps      ymm8,ymm8,ymm11  
vaddps      ymm3,ymm12,ymm3  
vaddps      ymm4,ymm13,ymm4  
vaddps      ymm7,ymm14,ymm7  
vaddps      ymm4,ymm6,ymm4  
vaddps      ymm6,ymm8,ymm7  
vaddps      ymm5,ymm5,ymm9  
vaddps      ymm3,ymm5,ymm3  
vaddps      ymm0,ymm3,ymm0  
vaddps      ymm1,ymm4,ymm1  
vaddps      ymm2,ymm6,ymm2  
Operator:
lea         rdx,[rsp+2A0h]  
vxorps      xmm0,xmm0,xmm0  
vmovdqu     xmmword ptr [rdx],xmm0  
vmovdqu     xmmword ptr [rdx+10h],xmm0  
vmovdqu     xmmword ptr [rdx+20h],xmm0  
vmovdqu     xmmword ptr [rdx+30h],xmm0  
vmovdqu     xmmword ptr [rdx+40h],xmm0  
vmovdqu     xmmword ptr [rdx+50h],xmm0  
vmovupd     ymm0,ymmword ptr [rbp]  
vaddps      ymm0,ymm0,ymmword ptr [rbp]  
lea         rdx,[rsp+2A0h]  
vmovupd     ymmword ptr [rdx],ymm0  
vmovupd     ymm0,ymmword ptr [rbp+20h]  
vaddps      ymm0,ymm0,ymmword ptr [rbp+20h]  
lea         rdx,[rsp+2C0h]  
vmovupd     ymmword ptr [rdx],ymm0  
vmovupd     ymm0,ymmword ptr [rbp+40h]  
vaddps      ymm0,ymm0,ymmword ptr [rbp+40h]  
lea         rdx,[rsp+2E0h]  
vmovupd     ymmword ptr [rdx],ymm0  
lea         rcx,[rsp+540h]  
lea         rdx,[rsp+2A0h]  
lea         rdx,[rsp+2A0h]  
mov         r8d,60h  
call        00007FFC1961C290  
  ... repeat another 7 times...

The manually inlined variant does pretty well, producing a tight sequence of 24 vaddps instructions operating on ymm registers. Without optimizing away the redundant adds, that’s about as good as you’re going to get.

The operator version is… less good. Clearing a bunch of memory, unnecessary loads and stores, capped off with a curious function call. Not surprising that it’s 50 times slower.

Clearly something wonky is going on there, but let’s move on for now. Zooming in a bit so we can see the other results:

3.0.0-preview1-26829-01 Add Benchmarks, clamped.png

Both Vector{T} and AVX are slower than VFloat when manually inlined, but that’s expected given that half the adds got optimized away. Unfortunately, it looks like even non-operator functions take a hit relative to the manually inlined implementation.

When manually inlined, 8-wide AVX is also a little faster than 4-wide Vector{T}. On a 3770K, the relevant 4 wide and 8 wide instructions have the same throughput, so being pretty close is expected. The marginal slowdown arises from the Vector{T} implementation using extra vmovupd instructions to load input values. Manually caching the values in a local variable actually helps some.

Focusing on the function and operator slowdown, here’s the assembly generated for the Vector{T} function and operator cases:

Vector<T> add function:
vmovupd     xmm0,xmmword ptr [r8]  
vmovupd     xmm1,xmmword ptr [r8]  
vaddps      xmm0,xmm0,xmm1  
vmovapd     xmmword ptr [rsp+110h],xmm0  
vmovupd     xmm0,xmmword ptr [r8+10h]  
vmovupd     xmm1,xmmword ptr [r8+10h]  
vaddps      xmm0,xmm0,xmm1  
vmovapd     xmmword ptr [rsp+100h],xmm0  
vmovupd     xmm0,xmmword ptr [r8+20h]  
vmovupd     xmm1,xmmword ptr [r8+20h]  
vaddps      xmm0,xmm0,xmm1  
vmovapd     xmmword ptr [rsp+0F0h],xmm0  
... repeat...
Vector<T> operator:
vmovapd     xmm3,xmmword ptr [rsp+170h]  
vmovapd     xmm4,xmmword ptr [rsp+160h]  
vmovapd     xmm5,xmmword ptr [rsp+150h]  
vmovupd     xmm6,xmmword ptr [r8]  
vmovupd     xmm7,xmmword ptr [r8]  
vaddps      xmm6,xmm6,xmm7  
vmovapd     xmmword ptr [rsp+140h],xmm6  
vmovupd     xmm6,xmmword ptr [r8+10h]  
vmovupd     xmm7,xmmword ptr [r8+10h]  
vaddps      xmm6,xmm6,xmm7  
vmovapd     xmmword ptr [rsp+130h],xmm6  
vmovupd     xmm6,xmmword ptr [r8+20h]  
vmovupd     xmm7,xmmword ptr [r8+20h]  
vaddps      xmm6,xmm6,xmm7  
vmovapd     xmmword ptr [rsp+120h],xmm6  
... repeat...

Nothing crazy happening, but there’s clearly a lot of register juggling that the earlier manually inlined AVX version didn’t do. The add function versus manual inlining difference is more pronounced in the AVX case, but the cause is similar (with some more lea instructions).

But this is an early preview version. What happens if we update to a daily build from a few weeks after the one tested above?

Add Benchmarks, clamped.png

A little better on function AVX, and more than 17 times faster on operator AVX. Not ideal, perhaps, but much closer to reasonable.

(If you’re wondering why the AVX path seems to handle things differently than the Vector{T} paths, Vector{T} came first and has its own set of JIT intrinsic implementations. The two may become unified in the future, on top of some additional work to avoid quality regressions.)

Microbenchmarks are one thing; how do these kinds of concerns show up in actual use? As an example, consider the box-box contact test. To avoid a book-length post, I’ll omit the generated assembly.

Given that manual inlining isn’t exactly a viable option in most cases, v2 usually uses static functions with in/out parameters. As expected, the generated code looks similar to the microbenchmark with the same kind of function usage. Here’s a VTune snapshot of the results:

instructionbottleneck.png

The CPI isn’t horrible, but most of the bottleneck is related to the loading instructions. The above breaks out the 37.4% of cycles which are stalled on front-end bottlenecks. The instruction cache misses and delivery inefficiencies become relevant when there are no memory bottlenecks to hide them. With deeper analysis, many moves and loads/stores could be eliminated and this could get a nice boost.

Another fun note, from the header of BoxPairTester.Test when inlining the function is disabled:

mov         ecx,2AAh  
xor         eax,eax  
rep stos    dword ptr [rdi] 

CoreCLR aggressively clears locally allocated variables if the IL locals init flag is set. Given that the flag is almost always set, it’s possible to spend a lot of time pointlessly zeroing memory. Here, the rep stos instruction performs 2AAh = 682 iterations. Each iteration sets 4 bytes of memory to the value of just-zeroed eax register, so this zeroes out 2728 bytes of stack space every single time the function is called.

In practice, many such clears are amortized over multiple function calls by forcing inlining, but unless the locals init flag is stripped, they’ll still happen. When compiled under ReleaseStrip configuration, v2 uses a post-build step to strip the locals init flag (and in the future there will likely be other options). Some simulations can improve by over 10% with the clearing stripped.

Summary

If you’re writing the kind of code where the generated assembly quality actually matters and isn’t bottlenecked by something else like memory, you should probably sanity test the performance occasionally or peek at the generated assembly to check things out. The JIT is improving, but there are limits to how much deep analysis can be performed on the fly without interfering with user experience.

And if you’re trying to use preview features that are still under active development, well, you probably know what you’re getting into.

A different kind of C#: the JIT doesn't really need demonic empowerment

It would be a little silly to write a series on performance in C# without mentioning the Just-In-Time (JIT) compiler. Unlike an offline toolchain that precompiles assemblies for specific platforms ahead of time (AOT), many C# applications compile on demand on the end user's device. While this does theoretically give a JIT more knowledge about the target system, it also constrains how much time is available to compile. Most users won't tolerate a 45 second startup time even if it does make everything run 30% faster afterwards. 

It's worth mentioning that there are AOT compilation paths, and some platforms require AOT. Mono has historically provided such a path, .NET Native is used for UWP apps, and the newer CoreRT is moving along steadily. AOT does not always imply deep offline optimization, but the relaxation of time constraints at least theoretically helps. There's also ongoing work on tiered compilation which could eventually lead to higher optimization tiers. 

One common concern is that running through any of today's JIT-quality compilers will result in inferior optimizations that render C# a dead end for truly high performance code. It's definitely true that the JIT is not able to optimize as deeply as an offline process, and this can show up in a variety of use cases.

But before diving into that, I would like to point out some important context. Consider the following simulation, a modified version of the ClothLatticeDemo. It's 65536 spheres connected by 260610 ball socket joints plus any collision related constraints that occur on impact with the table-ball-thing.

clothlattice256x256.png

On my 3770K, it runs at about 30 ms per frame prior to impact, and about 45 ms per frame after impact. The vast majority of that time is spent in the solver executing code that looks like this (from BallSocket.cs):

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Solve(ref BodyVelocities velocityA, ref BodyVelocities velocityB, ref BallSocketProjection projection, ref Vector3Wide accumulatedImpulse)
{
    Vector3Wide.Subtract(velocityA.Linear, velocityB.Linear, out var csv);
    Vector3Wide.CrossWithoutOverlap(velocityA.Angular, projection.OffsetA, out var angularCSV);
    Vector3Wide.Add(csv, angularCSV, out csv);
    Vector3Wide.CrossWithoutOverlap(projection.OffsetB, velocityB.Angular, out angularCSV);
    Vector3Wide.Add(csv, angularCSV, out csv);
    Vector3Wide.Subtract(projection.BiasVelocity, csv, out csv);

    Symmetric3x3Wide.TransformWithoutOverlap(csv, projection.EffectiveMass, out var csi);
    Vector3Wide.Scale(accumulatedImpulse, projection.SoftnessImpulseScale, out var softness);
    Vector3Wide.Subtract(csi, softness, out csi);
    Vector3Wide.Add(accumulatedImpulse, csi, out accumulatedImpulse);

    ApplyImpulse(ref velocityA, ref velocityB, ref projection, ref csi);
}

It's a whole bunch of math in pretty tight loops. Exactly the kind of situation where you might expect a better optimizer to provide significant wins. And without spoiling much, I can tell you that the JIT could do better with the generated assembly here.

Now imagine someone at Microsoft (or maybe even you, it's open source after all!) receives supernatural knowledge in a fever dream and trades their soul to empower RyuJIT. Perversely blessed by the unfathomable darkness below, RyuJIT v666 somehow makes your CPU execute all instructions in 0 cycles. Instructions acting only upon registers are completely free with infinite throughput, and the only remaining cost is waiting on data from cache and memory.

How much faster would this simulation run on my 3770K when compiled with RyuJIT v666?

Take a moment and make a guess.

Infinitely faster can be ruled out- even L1 cache wouldn't be able to keep with with this demonically empowered CPU. But maybe the cost would drop from 45 milliseconds to 1 millisecond? Maybe 5 milliseconds?

From VTune:

clothlatticememorybandwidth.png

The maximum effective bandwidth of the 3770K in the measured system is about 23 GBps. Prior to impact, the simulation is consuming 18-19 GBps of that. Post-impact, it hovers around 15 GBps, somewhat reduced thanks to the time spent in the less bandwidth heavy collision detection phase. (There's also a bandwidth usage dip hidden by that popup box that corresponds to increased bookkeeping when all the collisions are being created, but it levels back out to around 15 GBps pretty quickly.)

If we assume that the only bottleneck is memory bandwidth, the speedup is at most about 1.25x before impact, and 1.55x after. In other words, the frame times would drop to no better than 24-30 ms. Realistically, stalls caused by memory latency would prevent those ideal speedups from being achieved.

RyuJIT v666, and by extension all earthly optimizing compilers, can't speed this simulation up much. Even if I rewrote it all in C it would be unwise to expect more than a few percent. Further, given that compute improves faster than memory in most cases, newer processors will tend to benefit even less from demons.

Of course, not every simulation is quite so memory bandwidth bound. Simulations that involve complex colliders like meshes will tend to have more room for magic compilers to work. It just won't ever be that impressive.

So, could the JIT-generated assembly be better? Absolutely, and it is getting better, rapidly. Could there sometimes be serious issues for very specific kinds of code, particularly when unbound by memory bottlenecks? Yes.

But is it good enough to create many complex speedy applications? Yup!

A different kind of C#: value types

Idiomatic C# tends to be object oriented and reliant on the garbage collector. That style can make modelling certain kinds of application logic easier, but as soon as performance is a requirement, strict OOP is a path to suffering. As this is a property of the underlying memory access patterns and the limitations of hardware, this applies to deeply OOP implementations in any language- even C++.

I won't go into a full digression on data oriented design, but the short version is that memory is extremely slow compared to everything else that a CPU is doing. In the time it takes to retrieve a piece of data from main memory, a single CPU core can execute hundreds of instructions. That stall can happen every single time a new piece of memory is touched. Large idiomatic object oriented applications tend to form an enormous web of references which, when followed, require incoherent and often uncached memory accesses. It doesn't matter how good your compiler is, that's going to leave the CPU mostly idle.

Here's an example from BEPUphysics v1, a demo of a bunch of boxes connected together to behave like stiff cloth:

v1clothpicture.png
v1drambound.png
v1latencybound.png

Before we get into how bad this is, let me advocate for v1 a little bit here:

  • I disabled multithreading for this test, so all stalls remain unfilled by other thread work.
  • This simulation is pretty large (at least for v1) at 14400 dynamic entities and 84734 constraints, so not everything will fit into the CPU caches even under ideal circumstances.
  • I intentionally evicted everything from caches between updates. That might happen in some applications that need to do a bunch of non-physics work, but not all of them.

If you're not familiar with Intel VTune (which has a free license now!), 'CPI' refers to cycles per instruction and 'DRAM Bound' refers to the percentage of cycles which are stuck waiting on main memory requests. Breaking out DRAM Bound shows memory bandwidth versus memory latency bottlenecks.

Okay, how bad is this?

  •  3-5 CPI means the CPU is doing very, very little besides waiting. Note that an ideal CPI is not 1; modern processors can execute more than one instruction per cycle in most relevant cases (see reciprocal throughput numbers on Agner Fog's always-useful instruction tables).
  • While not shown above, it's worth mentioning that BEPUphysics v1 was built pre-vectorization support in the runtime, so in addition to executing <<1 instruction per cycle, those instructions work with at most one scalar value at a time. This particular simulation uses about 0.4% of the 3770K's floating point throughput. To be fair, reaching even 10% can be tricky in highly optimized complex workloads, but 0.4% is just... not good.
  • A full 82% of cycles in the core solving function are stuck waiting on memory requests that the prefetcher could not predict, and which were not already in any cache. These requests take the form of body velocity, inertia, and constraint data, and in v1, all of them involve randomized memory accesses.
  • UpdateSolverActivity is mostly bottlenecked by total bandwidth rather than latency. It can be tempting to look at a bandwidth bottleneck and shrug helplessly- after all, you need a certain amount of data to compute what you need, there's not much left to do, right? But the memory subsystem of a CPU doesn't work with 4 or 12 bytes in isolation, it works with cache lines which span 64 bytes (on AMD/Intel/many other CPUs). If you ask for a boolean flag all by itself, the CPU will also pull down the surrounding cache line. If the data you truly want is sparsely distributed, the cache line utilization will be terrible and bandwidth required to serve all memory requests will be vastly higher than a tight packing.
  • Note that the solver executes multiple iterations each frame. The second iteration and beyond will benefit from the first iteration pulling in a bunch of the required memory into L3 cache. In this case, the simulation is too big to fit all at once, but it does hold a decent chunk. If the simulation was larger or the cache got otherwise evicted between iterations, the above numbers would be even worse.

The core issue behind all of these is pulling memory from a bunch of unpredictable places. This arises mostly from v1's OOP-y design. For example, the solver stores an array of all constraints:

RawList<SolverUpdateable> solverUpdateables = new RawList<SolverUpdateable>();

But every SolverUpdateable is a reference type:

public abstract class SolverUpdateable

An array of reference types is actually an array of pointers. Outside of rare ideal cases, it's unlikely that the data pointed to by these references will be in order and contiguous. No matter how this array is traversed, the accesses will still jump all over memory and suffer tons of cache misses.

(Side note: the v1 solver also intentionally randomizes solver order for the sake of avoiding some pathological constraint ordering which causes even more cache misses. OOPiness can't be blamed for that, but it's still something that v2 stopped doing.)

There's a lot of other subtle stuff going on here too, but the single largest takeway is that we need a way to express memory in a contiguous, structured way. Fortunately, this feature has existed since the primordial era of C#: value types

With value types, you can create a contiguous block of memory with values lined up in order. Take the following code as an example:

struct Snoot
{
    public int A;
    public float B;
    public long C;
}

public static void Boop()
{
    var snoots = new Snoot[4];
    for (int i = 0; i < snoots.Length; ++i)
    {
        ref var snoot = ref snoots[i];
        var value = i + 1;
        snoot.A = value;
        snoot.B = value;
        snoot.C = value;
    }

    ref var snootBytes = ref Unsafe.As<Snoot, byte>(ref snoots[0]);
    for (int i = 0; i < snoots.Length; ++i)
    {
        Console.Write($"Snoot {i} bytes: ");
        for (int j = 0; j < Unsafe.SizeOf<Snoot>(); ++j)
        {
            Console.Write($"{Unsafe.Add(ref snootBytes, i * Unsafe.SizeOf<Snoot>() + j)}, ");
        }
        Console.WriteLine();
    }
}

Executing Boop produces:

Snoot 0 bytes: 1, 0, 0, 0, 0, 0, 128, 63, 1, 0, 0, 0, 0, 0, 0, 0,
Snoot 1 bytes: 2, 0, 0, 0, 0, 0, 0, 64, 2, 0, 0, 0, 0, 0, 0, 0,
Snoot 2 bytes: 3, 0, 0, 0, 0, 0, 64, 64, 3, 0, 0, 0, 0, 0, 0, 0,
Snoot 3 bytes: 4, 0, 0, 0, 0, 0, 128, 64, 4, 0, 0, 0, 0, 0, 0, 0,

Walking sequentially through the array, we can directly observe the byte values that make up the Snoot instances. There are no indirections that need to be tracked down.

So, if a traditional reference-heavy object oriented memory model isn't great, what is an alternative? BEPUphysics v2 shows one possibility. Using the solver as an example again, here is how v2 represents a group of constraints:

public struct TypeBatch
{
    public RawBuffer BodyReferences;
    public RawBuffer PrestepData;
    public RawBuffer AccumulatedImpulses;
    public RawBuffer Projection;
    public Buffer<int> IndexToHandle;
    public int ConstraintCount;
    public int TypeId;
}

An important note here is that the properties of a constraint- body references, prestep data, and so on- are split into different arrays. Not every stage of constraint processing needs to access every constraint property. For example, solve iterations do not need PrestepData at all; trying to bundle prestep data with the rest of the properties would just waste valuable space in cache lines during the solve. That helps with memory bandwidth.

There's no processing logic in the TypeBatch at all, though. It's raw data. Untyped, even- the RawBuffer just refers to a block of bytes. How is it used?

Each TypeBatch contains only one type of constraint, as the name may imply. TypeBatches are... processed... by a TypeProcessor. TypeProcessors are built to understand a single kind of constraint and have no state of their own; they could easily be implemented as function pointers of static functions. In the solver, you'll find:

public TypeProcessor[] TypeProcessors;

When the solver encounters a TypeBatch with a TypeId of 3, it can just do a quick array lookup to find the TypeProcessor which knows what to do with that TypeBatch's data.

(Side note: the cost of each virtual call is amortized over an entire type batch. Executing a type batch will tend to take hundreds of nanoseconds at an absolute minimum, so adding 2-4 nanoseconds for an indirection is pretty much irrelevant. Compare this to the OOPy approach of having a bunch of SolverUpdateable references in an array and doing a virtual call on every individual instance. Not a big deal compared to data cache misses, but still pointless overhead that can be avoided.)

In the end, this enables blasting through contiguous arrays of data with tight loops of code that look like this:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Solve(ref BodyVelocities velocityA, ref BodyVelocities velocityB, ref AngularHingeProjection projection, ref Vector2Wide accumulatedImpulse)
{
    Vector3Wide.Subtract(velocityA.Angular, velocityB.Angular, out var difference);
    Matrix2x3Wide.TransformByTransposeWithoutOverlap(difference, projection.VelocityToImpulseA, out var csi);
    Vector2Wide.Scale(accumulatedImpulse, projection.SoftnessImpulseScale, out var softnessContribution);
    Vector2Wide.Add(softnessContribution, csi, out csi);
    Vector2Wide.Subtract(projection.BiasImpulse, csi, out csi);
    Vector2Wide.Add(accumulatedImpulse, csi, out accumulatedImpulse);
    ApplyImpulse(ref velocityA.Angular, ref velocityB.Angular, ref projection, ref csi);
}

It's pure math with no branches and no constraint cache misses. While GPUs get the spotlight when it comes to compute throughput, CPUs are still extremely good at this, especially with instruction level parallelism. Every single operation in the above is executed on several constraints at the same time using SIMD instructions (which I'll cover in more detail later).

There are some cache misses hidden in that chunk of code, but they are fundamentally required by the nature of the solver algorithm- body velocities can't be stored alongside constraint data because there is no one to one mapping between them, so they must be incoherently gathered and scattered. (Side note: these cache misses can be partially mitigated; the library tries to group constraints by access patterns.)

How much of a difference does this make? Here's some more vtune data, this time of a comparable v2 simulation. Same processor, multithreading still disabled, cache still evicted between every frame.

v2cpiprettygood.png

0.454 cycles per instruction is a wee bit better than the 5+ we observed in v1. Main memory latency is pretty much eliminated as a problem. And on top of that, just about every one of those instructions is operating on multiple lanes of data with SIMD. This is a big part of why v2 is often an order of magnitude faster than v1.

Summary

Value types are not new to C# and certainly not new to programming in general, but controlling memory access patterns is absolutely critical for performance. Without this, all other attempts at optimization would be practically irrelevant.

The next step will be to see how painless we can make this style of programming in modern C#.

A different kind of C#: the BEPUphysics v2 perspective

In the last couple of decades, managed languages have come to dominate a huge chunk of application development with a solid mix of high productivity and good-enough performance. Nowadays, you need a pretty strong justification to jump to a 'lower level' language, and that justification almost always has something to do with performance.

High frequency trading firms wouldn't want a garbage collection invocation stalling a trade by several milliseconds. Game developers working on the bleeding edge wouldn't want to sacrifice a feature or FPS target to compensate for a just-in-time compiler's lack of optimizations. It makes sense for those sorts of projects to work with a mature performance-focused ecosystem like the one around C or C++.

And then there's BEPUphysics v2- a super speedy physics simulation library built to take advantage of modern CPUs and routinely outperforms its predecessor by a factor of 10 or more, often pushing against fundamental limitations in current architectures.

And it's written from the ground up in C#, a managed language with a garbage collector and just-in-time compilation.

This is odd.

The goal of this series is to answer how this came to be, what parts of the language and runtime have enabled it, and how the future might look. Everything here should be considered a snapshot- readers a year from now should assume many things have changed. (And I might be unaware of some relevant work in the runtime right now- it's moving quickly.)

This will not be a language comparison or physics library comparsion (well, apart from bepuphysics v1 and v2). The goal is to explain a pleasantly surprising rapid evolution of complementary technologies through the lens of bepuphysics v2's development.

  1. Value types

  2. The JIT doesn't really need demonic empowerment

  3. Watching your step

  4. More to come...

v1.5.1 Versus v2.0.0-alpha: Early Benchmarks

It's time for benchmarks!

Historically, performance improvements in BEPUphysics v1 came incrementally. Each new big version came with a 20% boost here, another 10% over there, gradually accumulating to make v1.5.1 pretty speedy.

The jump from v1.5.1 to v2.0.0, even in its alpha state, is not incremental.

That's a change from about 120 ms per frame in v1 to around 9.5 ms in v2 when using 8 threads. The difference is large enough that the graph gets a little difficult to interpret, so the remaining graphs in this post will just show the speedup factor from v1 to v2.

Benchmarks

The benchmarks cover four test cases:
  1. Cloth lattice composed of 16384 spheres and 64470 ball socket constraints. Deactivation disabled.
  2. 20 pyramids, each containing 210 boxes. Deactivation disabled.
  3. Lots of statics with inactive bodies resting on top. 4096 inactive bodies, 9216 statics.
  4. Collapsing pile of boxes, capsules, and spheres. 4096 total bodies. Deactivation disabled.

If you'd like to run the benchmarks yourself, grab the benchmarks source from github: github.com/RossNordby/scratchpad/tree/master/Benchmarks

I bugged a bunch of people with different processors to run the benchmarks. Some of these benchmarks were recorded with minor background tasks, so it's a fairly real world sample. All benchmarks ran on Windows 10 with .NET Core 2.0.6.

Results

First, two processors that use 128 bit SIMD:

ShapePile and Pyramids both require quite a bit of narrow phase work which involves a chunk of scalar bookkeeping. With fewer opportunities for vectorization, they don't benefit as much as the extremely solver heavy ClothLattice benchmark.

Note that the ClothLattice benchmark tends to allow v1 to catch up a little bit at higher core counts. This is largely because of limited memory bandwidth. The solver is completely vectorized apart from the bare minimum of gathers and scatters, so it's difficult to fetch constraints from memory as quickly as the cores can complete the ALU work. v1, in contrast, was just a giant mess of cache misses, and it's very easy for cores to idle in parallel.

LotsOfStatics is another interesting case: statics and inactive bodies are placed into an extremely low cost state compared to v1. In fact, the only stage that is aware of those objects at all is the broad phase, and the broad phase has a dedicated structure for inactive bodies and statics. In terms of absolute performance, the LotsOfStatics benchmark took under 350 microseconds per frame with 8 threads on the 3770K.

Notably, this is an area where v2 still has room for improvement. The broad phase is far too aggressive about refining the static structure; I just haven't gotten around to improving the way it schedules refinements yet. This particular benchmark could improve by another factor of 2-4 easily.

Now for a few AVX2-capable processors:

There are some pretty interesting things going on here. Most noticeably, the 7700HQ shows a much larger improvement across the board than any of the other processors tested. This is expected when comparing against the 128-wide platforms (3770K and 920) thanks to the significantly higher floating point throughput. The ClothLattice demo in particular shows a speedup of 10-15x compared to the 3770K's 7.4-11.9x.

While I haven't verified this with direct measurement, I suspect the 7700HQ benefits more than the AVX2 capable 4790K and 1700X thanks to its significantly higher effective memory bandwidth per core. The 4790K roughly matches the non-AVX2 3770K in bandwidth, and 1700X actually has significantly less bandwidth per core than the 3770K. The memory bandwidth hungry ClothLattice benchmark timings are consistent with this explanation- both the 4790K and 1700X show noticeably less benefit with higher core counts compared to the 7700HQ.

Zen's AVX2 implementation also has slightly different performance characteristics. I have not done enough testing to know how much this matters. If RyuJIT is generating a bunch of FMA instructions that I've missed, that could contribute to the 7700HQ's larger improvement.

There also a peculiar massive improvement in the LotsOfStatics demo, topping out at 42.9x faster on the 7700HQ. That's not something I expected to see- that benchmark spends most of its time in weakly vectorized broadphase refinements. I haven't yet done a deep dive on refinement performance (it's slated to be reworked anyway), but this could be another area where the 7700HQ's greater bandwidth is helpful.

Summary

v2 is vastly faster than v1. It's not just a way to save money on server hardware or to allow for a few more bits of cosmetic debris- it's so much faster that it can be used to make something qualitatively different.

And it's going to get faster!

BEPUphysics v1.5.0 now on GitHub!

I decided to move the demos over to monogame after many years of procrastination, then looked back on all the commits and said, hey, I might as well package these up for a release.

Then codeplex was temporarily refusing my pushes, so I said, hey, might as well put it on github.

Check out the newness on github!

(Certain observers may note that BEPUphysics v2 is, in fact, not yet out, nor is it even being actively developed yet. Don't worry, it's not dead or anything, I just have to get this other semirelated project out of the way. I'm hoping to get properly started on v2 in very early 2017, but do remember how good I am at estimating timelines.)

Blazing Fast Trees

The prototype for the first big piece of BEPUphysics v2.0.0 is pretty much done: a tree.

This tree will (eventually) replace all the existing trees in BEPUphysics and act as the foundation of the new broad phase.

So how does the current prototype compare with v1.4.0's broad phase?

It's a lot faster.

The measured 'realistic' test scene includes 65536 randomly positioned cubic leaves ranging from 1 to 100 units across, with leaf size given by 1 + 99 * X^10, where X is a uniform random value from 0 to 1. In other words, there are lots of smaller objects and a few big objects, and the average size is 10 units. All leaves are moving in random directions with speeds given by 10 * X^10, where X is a uniform random value from 0 to 1, and they bounce off the predefined world bounds (a large cube) so that they stay in the same volume. The number of overlaps ranges between 65600 and 66300.

Both simulations are multithreaded with 8 threads on a 3770K@4.5ghz. Notably, the benchmarking environment was not totally clean. The small spikes visible in the new implementation do not persist between runs and are just the other programs occasionally interfering.

So, the first obvious thing you might notice is that the old version spikes like crazy. Those spikes were a driving force behind this whole rewrite. What's causing them, and how bad can they get?

 

The answers are refinement and really bad. Each one of those spikes represents a reconstruction of part of the tree which has expanded beyond its optimal size. Those reconstructions aren't cheap, and more importantly, they are unbounded. If a reconstruction starts near the root, it may force a reconstruction of large fractions of the tree. If you're really unlucky, it will be so close to the root that the main thread has to do it. In the worst case, the root itself might get reconstructed- see that spike on frame 0? The graph is actually cut off; it took 108ms. While a full root reconstruction usually only happens on the first frame, the other reconstructions are clearly bad enough. These are multi-frame spikes that a user can definitely notice if they're paying attention. Imagine how that would feel in VR.

To be fair to the old broad phase, this test is a lot more painful than most simulations. The continuous divergent motion nearly maximizes the amount of reconstruction required. 

But there's something else going on, and it might be even worse. Notice that slow upward slope in the first graph? The new version doesn't have it at all, so it's not a property of the scene itself. What does the tree quality look like?

 

This graph represents the computed cost of the tree. If you've heard of surface area heuristic tree builders in raytracing, this is basically the same thing except the minimized metric is volume instead of surface area. (Volume queries and self collision tests have probability of overlap proportional to volume, ray-AABB intersection probability is proportional to surface area. They usually produce pretty similar trees, though.)

The new tree starts with poor quality since the tree was built using incremental insertion, but the new refinement process quickly reduces cost. It gets to around 37.2, compared to a full sweep rebuild of around 31.9.

The old tree starts out better since the first frame's root reconstruction does a full median split build. But what happens afterward? That doesn't look good. What happens if tree churns faster? How about a bunch of objects moving 10-100 instead of 0-10 units per second, with the same distribution?

 

 

Uh oh. The cost increases pretty quickly, and the self test cost rises in step. By the end, the new version is well over 10 times as fast. As you might expect, faster leaf speeds are even worse. I neglected to fully benchmark that since a cost metric 10000 times higher than it should be slows things down a little.

What's happening?

The old tree reconstructs nodes when their volume goes above a certain threshold. After the reconstruction, a new threshold is computed based on the result of the reconstruction. Unfortunately, that new threshold lets the tree degrade further next time around. Eventually, the threshold ratchets high enough that very few meaningful refinements occur. Note in the graph that the big refinement time spikes are mostly gone after frame 1000. If enough objects are moving chaotically for long periods of time, this problem could show up in a real game.

This poses a particularly large problem for long-running simulations like those on a persistent game server. The good news is that the new version has no such problem, the bad news is that there is no good workaround for the old version. For now, if you run into this problem, try periodically calling DynamicHierarchy.ForceRebuild (or look for the internal ForceRevalidation in older versions). As the name implies, it will reset the tree quality but at a hefty price. Expect to drop multiple frames.

(This failure is blindingly obvious in hindsight, and I don't know how I missed it when designing it, benchmarking it, or using it. I'm also surprised no one's reported it to my knowledge. Oops!)

So, how about if nothing is moving?

 

The old version manages to maintain a constant slope, though it still has some nasty spikes. Interestingly, those aren't primarily from refinement, as we'll see in a moment.

This is also a less favorable comparison for the new tree, "only" being 3 times as fast.

Splitting the time contributions helps explain both observations:
  

The old version's spikes can't be reconstructions given that everything is totally stationary, and the self test shows them too. I didn't bother fully investigating this, but one possible source is poor load balancing. It uses a fairly blind work collector, making it very easy to end up with one thread overworked. The new version, in contrast, is smarter about selecting subtasks of similar size and also collects more of them.

So why is the new refinement only a little bit faster if the self test is 3.5 times faster? Two reasons. First, the new refinement is never satisfied with doing no work, so in this kind of situation it does a bit too much. Second, I just haven't spent much time optimizing the refinement blocks for low work situations like this. These blocks are fairly large compared to the needs of a totally stationary tree, so very few of them need to be dispatched. In this case, there were only 2. The other threads sit idle during that particular subphase. In other words, the new tree is currently tuned for harder workloads.

Now, keeping leaves stationary, what happens when the density of leaves is varied? First, a sparse distribution with 8 times the volume (and so about one eighth the overlaps):

 

A bit over twice as fast. A little disappointing, but this is another one of those 'easy' cases where the new refinement implementation doesn't really adapt to very small workloads, providing marginal speedups.

How about the reverse? 64 times more dense than the above, with almost 500000 overlaps. With about 8 overlaps per leaf, this is roughly the density of a loose pile.

 

Despite the fact that the refinement suffers from the same 'easy simulation' issue, the massive improvement in test times brings the total speedup to over 5 times faster. The new tree's refinement takes less than a millisecond on both the sparse and dense cases, but the dense case stresses the self test vastly more. And the old tree is nowhere near as fast at collision tests.

Next up: while maintaining the same medium density of leaves (about one overlap per leaf), vary the number. Leaves are moving at the usual 0-10 speed again for these tests.  First, a mere 16384 leaves instead of 65536:

Only about 2.5 times faster near the end. The split timings are interesting, though: 

The self test marches along at around 3.5 times as fast near the end, but the refinement is actually slower... if you ignore the enormous spikes of the old version. Once again, there's just not enough work to do and the work chunks are too big at the moment. 400 microseconds pretty okay, though.

How about a very high leaf count, say, 262144 leaves? 

Around 4 times as fast. Refinement has enough to chomp on.

Refinement alone hangs around 2.5-2.75 times as fast, which is pretty fancy considering how much more work it's doing. As usual, the self test is super speedy, only occasionally dropping below 4.20 times as fast.

How about multithreaded scaling? I haven't investigated higher core counts yet, but here are the new tree's results for single threaded versus full threads on the 3770K under the original 65536 'realistic' case:

 

Very close to exactly 4 times as fast total. Self tests float around 4.5 times faster. As described earlier, this kind of 'easy' simulation results in a fairly low scaling in refinement- only about 2.3 times faster. If everything was flying around at higher speeds, refinement would be stressed more and more work would be available.

For completeness, here's the new tree versus the old tree, singlethreaded, in the same simulation:

 

3 times faster refines (ignoring spikes), and about 4.5 faster in general.

How does it work?

The biggest conceptual change is the new refinement phase. It has three subphases:

1) Refit

As objects move, the node bounds must adapt. Rather than doing a full tree reconstruction every frame, the node bounds are recursively updated to contain all their children.

During the refit traversal, two additional pieces of information are collected. First, nodes with a child leaf count below a given threshold are added to 'refinement candidates' set. These candidates are the roots of a bunch of parallel subtrees. Second, the change in volume of every node is computed. The sum of every node's change in volume divided by the root's volume provides the change in the cost metric of the tree for this frame.

2) Binned Refine

A subset of the refinement candidates collected by the refit traversal are selected. The number of selected candidates is based on the refit's computed change in cost; a bigger increase means more refinements. The frame index is used to select different refinement candidates as time progresses, guaranteeing that the whole tree eventually gets touched.

The root always gets added as a refinement target. However, the refinement is bounded. All of these refinements tend to be pretty small. Currently, any individual refinement in a tree with 65536 leaves will collect no more than 768 subtrees, a little over 1%. That's why there are no spikes in performance.

Here's an example of candidates and targets in a tree with 24 leaves:

The number within each node is the number of leaves in the children of that node. Green circles are leaf nodes, purple circles are refinement candidates that weren't picked, and red circles are the selected refinement targets. In this case, the maximum number of subtrees for any refinement was chosen as 8.

Since the root has so many potential nodes available, it has options about which nodes to refine. Rather than just diving down the tree a fixed depth, it seeks out the largest nodes by volume. Typically, large nodes tend to be a high leverage place to spend refine time. Consider a leaf node that's moved far enough from its original position that it should be in a far part of the tree. Its parents will tend to have very large bounds, and refinement will see that.

For multithreading, refinement targets are marked (only the refinement treelet root, though- no need to mark every involved node). Refinement node collection will avoid collecting nodes beyond any marked node, allowing refinements to proceed in parallel.

The actual process applied to each refinement target is just a straightforward binned builder that operates on the collected nodes. (For more about binned builders, look up "On fast Construction of SAH-based Bounding Volume Hierarchies" by Ingo Wald.)

3) Cache Optimize 

The old tree allocated nodes as reference types and left them scattered through memory. Traversing the tree was essentially a series of guaranteed cache misses. This is not ideal.

The new tree is just a single contiguous array. While adding/removing elements and binned refinements can scramble the memory order relative to tree traversal order, it's possible to cheaply walk through parts of the tree and shuffle nodes around so that they're in the correct relative positions. A good result only requires optimizing a fraction of the tree; 3% to 5% works quite well when things aren't moving crazy fast. The fraction of cache optimized nodes scales with refit-computed cost change as well, so it compensates for the extra scrambling effects of refinement. In most cases, the tree will sit at 80-95% of cache optimal. (Trees with only a few nodes, say less than 4096, will tend to have a harder time keeping up right now, but they take microseconds anyway.)

Cache optimization can double performance all by itself, so it's one of the most important improvements.

As for the self test phase that comes after refinement, it's pretty much identical to the old version in concept. It's just made vastly faster by a superior node memory layout, cache friendliness, greater attention to tiny bits, and no virtual calls. 

Interestingly, SIMD isn't a huge part of the speedup. It's used here and there (mainly refit), but not to its full potential. The self test in particular, despite being the dominant cost, doesn't use SIMD at all. 

Future work

1) Solving the refinement scaling issue for 'easy' simulations would be nice.

2) SIMD is a big potential area for improvement. As mentioned, this tree is mostly scalar in nature. At best, refit gets decent use of 3-wide operations. My attempts at creating fully vectorized variants tended to do significantly better than the old one, but they incurred too much overhead in many phases and couldn't beat the mostly scalar new version. I'll probably fiddle with it some more when a few more SIMD instructions are exposed, like shuffles; it should be possible to get at least another 1.5 to 2 times speedup.

3) Refinement currently does some unnecessary work on all the non-root treelets. They actually use the same sort of priority queue selection, even though they are guaranteed to eat the whole subtree by the refinement candidate collection threshold. Further, it should be possible to improve the node collection within refinement by taking into account the change in subtree volume on a per-node level. The root refinement would seek out high entropy parts of the tree. Some early testing implied this would help, but I removed it due to memory layout conflicts. 

4) I suspect there are some other good options for the choice of refinement algorithm. I already briefly tried agglomerative and sweep refiners (which were too slow relative to their quality advantage), but I didn't get around to trying things like brute forcing small treelet optimization (something like "Fast Parallel Construction of High-Quality Bounding Volume Hierarchies"). I might revisit this when setting up the systems of the next point.

5) It should be possible to improve the cache optimization distribution. Right now, the multithreaded version is forced into a suboptimal optimization order and suffers from overhead introduced by lots of atomic operations. Some experiments with linking cache optimization to the subtrees being refined showed promise. It converged with little effort, but it couldn't handle the scrambling effect of root refinement. I think this is solvable, maybe in combination with #4.

6) Most importantly, all of the above assumes a bunch of dynamic leaves. Most simulations have tons of static or inactive objects. The benchmarks show that the new tree doesn't do a bad job on these by any means, but imagine all the leaves were static meshes. There's no point in being aggressive with refinements or cache optimizations because nothing is moving or changing, and there's no need for any collision self testing if static-static collisions don't matter.

This is important because the number of static objects can be vastly larger than the number of dynamic objects. A scene big enough to have 5000 active dynamic objects might have hundreds of thousands of static/inactive objects. The old broad phase would just choke and die completely, requiring extra work to use a StaticGroup or something (which still wouldn't provide optimal performance for statics, and does nothing for inactive dynamics). In contrast, a new broad phase that has a dedicated static/inactive tree could very likely handle it with very little overhead.

When I have mentioned big planned broad phase speedups in the past ("over 10 times on some scenes"), this is primarily what I was referring to. The 4 times speedup of the core rewrite was just gravy.

Now what?

If you're feeling adventurous, you can grab the tree inside of the new scratchpad repository on github. Beware, it's extremely messy and not really packaged in any way. There are thousands of lines of dead code and diagnostics, a few dependencies are directly referenced .dlls rather than nice nuget packages, and there's no documentation. The project also contains some of the vectorized trees (with far fewer features) and some early vectorized solver prototyping. Everything but the Trees/SingleArray tree variant is fairly useless, but it might be interesting to someone.

In the future, the scratchpad repo will be where I dump incomplete code scribblings, mostly related to BEPUphysics.

I'm switching developmental gears to some graphics stuff that will use the new tree. It will likely get cleaned up over time and turned into a more usable form over the next few months. A proper BEPUphysics v2.0.0 repository will probably get created sometime in H1 2016, though it will remain incomplete for a while after that.

BEPUphysics in a CoreCLR World

A lot of exciting stuff has happened in the .NET world over the last year, and BEPUphysics is approaching some massive breaking changes. It seems like a good time to condense the plans in one spot.

First, expect v1.4.0 to get packaged up as a stable release in the next couple of months. At this time, I expect that v1.4.0 will likely be the last version designed with XNA platform compatibility in mind.

Following what seems to be every other open source project in existence, BEPUphysics will probably be moving to github after v1.4.0 is released.

Now for the fun stuff:


BEPUphysics v2.0.0

High Level Overview:

Performance drives almost everything in v2.0.0. Expect major revisions; many areas will undergo total rewrites. Applications may require significant changes to adapt. The revisions follow the spirit of the DX11/OpenGL to DX12/Vulkan shift. The engine will focus on providing the highest possible performance with a minimal API.

Expect the lowest level engine primitives like Entity to become much 'dumber', behaving more like simple opaque data blobs instead of a web of references, interfaces, and callbacks. The lowest layer will likely assume the user knows what they're doing. For example, expect a fundamental field like LinearVelocity to be exposed directly and without any automatic activation logic. "Safe" layers that limit access and provide validation may be built above this to give new users fewer ways to break everything.

Features designed for convenience will be implemented at a higher level explicitly separated from the core simulation or the responsibility will be punted to the user.

Some likely victims of this redesign include:
-Internal timestepping. There is really nothing special about internal timestepping- it's just one possible (and very simple) implementation of fixed timesteps that could, and probably should, be implemented externally.
-Space-resident state buffers and state interpolation. Users who need these things (for asynchronous updates or internal timestepping) have to opt in anyway, and there's no reason to have them baked into the engine core.
-All deferred collision events, and many immediate collision events. The important degrees of access will be retained to enable such things to be implemented externally, but the engine will do far less.
-'Prefab' entity types like Box, Sphere, and so on are redundant and only exist for legacy reasons. Related complicated inheritance hierarchies and generics to expose typed fields in collidables will also likely go away.
-'Fat' collision filtering. Some games can get by with no filtering, or just bitfields. The engine and API shouldn't be hauling around a bunch of pointless dictionaries for such use cases.
And more. 

Platform Support:

Expect older platforms like Xbox360 and WP7 to be abandoned. The primary target will be .NET Core. RyuJIT and the new SIMD-accelerated numeric types will be assumed. Given the new thriving open source initiative, I think this is a safe bet.

Going forward, expect the engine to adopt the latest language versions and platform updates more rapidly. The latest version of VS Community edition will be assumed. Backwards compatibility will be limited to snapshots, similar to how v1.4.0 will be a snapshot for the XNA-era platforms.

Areas of Focus:

1) Optimizing large simulations with many inactive or static objects

In v1.4.0 and before, a common recommendation is to avoid broadphase pollution. Every static object added to the Space is one more object to be dynamically handled  by the broad phase. To mitigate this issue, bundling many objects into parent objects like StaticGroups is recommended. However, StaticGroups require explicit effort, lack dynamic flexibility, and are not as efficient as they could be.

Inactive objects are also a form of broadphase pollution, but unlike static objects, they cannot be bundled into StaticGroups. Further, these inactive objects pollute most of the other stages. In some cases, the Solver may end up spending vastly more time testing activity states than actually solving anything.

Often, games with these sorts of simulations end up implementing some form of entity tracking to remove objects outside of player attention for performance reasons. While it works in many cases, it would be better to not have to do it at all.

Two large changes are required to address these problems:
-The BroadPhase will be aware of the properties of static and inactive objects. In the normal case, additional static or inactive objects will incur almost no overhead. (In other words, expect slightly less overhead than the StaticGroup incurs, while supporting inactive dynamic objects.)
-Deactivation will be redesigned. Persistent tracking of constraint graphs will be dropped in favor of incremental analysis of the active set, substantially reducing deactivation maintenance overhead. Stages will only consider the active set, rather than enumerating over all objects and checking activity after the fact.

On the type of simulations hamstrung by the current implementation, these changes could improve performance hugely. In extreme cases, a 10x speedup without considering the other implementation improvements or SIMD should be possible.

2) Wide parallel scaling for large server-style workloads

While the engine scales reasonably well up to around 4 to 6 physical cores, there remain sequential bottlenecks and lock-prone bits of code. The NarrowPhase's tracking of obsolete collision pairs is the worst sequential offender. More speculatively, the Solver's locking may be removed in favor of a batching model if some other changes pan out.

The end goal is decent scaling on 16-64 physical cores for large simulations, though fully achieving this will likely require some time.

3) SIMD

With RyuJIT's support for SIMD types comes an opportunity for some transformative performance improvements. However, the current implementation would not benefit significantly from simply swapping out the BEPUutilities types for the new accelerated types. Similarly, future offline optimizing/autovectorizing compilers don't have much to work with under the current design. As it is, these no-effort approaches would probably end up providing an incremental improvement of 10-50% depending on the simulation.

To achieve big throughput improvements, the engine needs cleaner data flow, and that means a big redesign. The solver is the most obvious example. Expect constraints to undergo unification and a shift in data layout. The Entity object's data layout will likely be affected by these changes. The BroadPhase will also benefit, though how much is still unclear since the broad phase is headed for a ground up rewrite.

The NarrowPhase is going to be the most difficult area to adapt; there are a lot of different collision detection routines with very complicated state. There aren't as many opportunities for unification, so it's going to be a long case-by-case struggle to extract as much performance as possible. The most common few collision types will most likely receive in-depth treatment, and the remainder will be addressed as required.

Miscellaneous Changes:

-The demos application will move off of XNA, eliminating the need for a XNA Game Studio install. The drawer will be rewritten, and will get a bit more efficient. Expect the new drawer to use DX11 (feature level 11_0) through SharpDX. Alternate rendering backends for OpenGL (or hopefully Vulkan, should platform and driver support be promising at the time) may be added later for use in cross platform debugging. 

-As alluded to previously, expect a new broad phase with a much smoother (and generally lower) runtime profile. Focuses on incremental refinement; final quality of tree may actually end up higher than the current 'offline' hierarchies offered by BEPUphysics.

-StaticGroup will likely disappear in favor of the BroadPhase just handling it automatically, but the non-BroadPhase hierarchies used by other types like the StaticMesh should still get upgraded to at least match the BroadPhase's quality.

-Collision pair handlers are a case study in inheritance hell. Expect something to happen here, but I'm not yet sure what.

-Wider use of more GC-friendly data structures like the QuickList/QuickSet to avoid garbage and heap complexity.

-Convex casts should use a proper swept test against the broad phase acceleration structure. Should make long unaligned casts much faster.

-More continuous collision detection options. Motion clamping CCD is not great for all situations- particularly systems of lots of dynamic objects, like passengers on a plane or spaceship. The existing speculative contacts implementation helps a little to stabilize things, but its powers are limited. Granting extra power to speculative contacts while limiting ghost collisions would be beneficial.

-The CompoundShape could use some better flexibility. The CompoundHelper is testament to how difficult it can be to do some things efficiently with it.

Schedule Goals:

Variable. Timetable depends heavily on what else is going on in development. Be very suspicious of all of these targets.

Expect the earliest changes to start showing up right after v1.4.0 is released. The first changes will likely be related the debug drawer rewrite.

The next chunk may be CCD/collision pair improvements and the deactivation/broadphase revamp for large simulations. The order of these things is uncertain at this time because there may turn out to be some architectural dependencies. This work will probably cover late spring to mid summer 2015.

Early attempts at parallelization improvements will probably show up next. Probably later in summer 2015.

SIMD work will likely begin at some time in late summer 2015. It may take a few months to adapt the Solver and BroadPhase.

The remaining miscellaneous changes, like gradual improvements to collision detection routines, will occur over the following months and into 2016. I believe all the big changes should be done by some time in spring 2016.

This work won't be contiguous; I'll be hopping around to other projects throughout.

Future Wishlist:

-The ancient FluidVolume, though slightly less gross than it once was, is still very gross. It would be nice to fix it once and for all. This would likely involve some generalizations to nonplanar water- most likely procedural surfaces that would be helpful in efficiently modeling waves, but maybe to simple dynamic heightfields if the jump is short enough.

-Fracture simulation. This has been on the list for a very long time, but there is still a chance it will come up. It probably won't do anything fancy like runtime carving or voronoi shattering. More likely, it will act on some future improved version of CompoundShapes, providing different kinds of simple stress simulation that respond to collisions and environmental effects to choose which parts get fractured. (This isn't a very complicated feature, and as mentioned elsewhere on the forum, I actually implemented something like it once before in a spaceship game prototype- it just wasn't quite as efficient or as clean as a proper release would require.)

On GPU Physics:

In the past, I've included various kinds of GPU acceleration on the development wishlist. However, now, I do not expect to release any GPU-accelerated rigid body physics systems in the foreseeable future. BEPUphysics itself will stay exclusively on the CPU for the foreseeable future.

I've revisited the question of GPU accelerated physics a few times over the last few years, including a few prototypes. However, GPU physics in games is still primarily in the realm of decoration. It's not impossible to use for game logic, but having all of the information directly accessible in main memory with no latency is just a lot easier. 

And implementing individually complicated objects like the CharacterController would be even more painful in the coherence-demanding world of GPUs. (I would not be surprised if a GPU version of a bunch of full-featured CharacterControllers actually ran slower due to the architectural mismatch.) There might be a hybrid approach somewhere in here, but the extra complexity is not attractive.

And CPUs can give pretty-darn-decent performance. BEPUphysics is already remarkably quick for how poorly it uses the capabilities of a modern CPU.

And our own game is not a great fit for GPU simulation, so we have no strong internal reason to pursue it. Everything interacts heavily with game logic, there are no deformable objects, there are no fluids, any cloth is well within the abilities of CPU physics, and the clients' GPUs are going to be busy making pretty pictures.

This all makes implementing runtime GPU simulation a bit of a hard sell.

That said, there's a small chance that I'll end up working on other types of GPU accelerated simulation. For example, one of the GPU prototypes was a content-time tool to simulate flesh and bone in a character to automatically generate vertex-bone weights and pose-specific morph targets. We ended up going another direction in the end, but it's conceivable that other forms of tooling (like BEPUik) could end up coming out of continued development.

 

Have some input? Concerned about future platform support? Want to discuss the upcoming changes? Post on the forum thread this was mirrored from, or just throw tweets at me.

BEPUphysics v1.3.0 released!

Grab the new version! Check out the new stuff!

With this new version comes some changes to the forks. I've dropped the SlimDX and SharpDX forks in favor of focusing on the dependency free main fork.

The XNA fork will stick around for now, but the XNA fork's library will use BEPUutilities math instead of XNA math. Going forward, the fork will be used to maintain project configuration and conditional compilation requirements for XNA platforms.

Pointers!

I recently posted a couple of suggestions on uservoice that would save me from some future maintenance nightmares and improve the performance and memory usage of BEPUphysics in the long term.

First, it would be really nice to have wide support for unsafe code so that I wouldn't have to maintain a bunch of separate code paths. One of the primary targets of BEPUphysics, Windows Phone, does not currently support it. It would be nice if it did! If you'd like to help, vote and share this suggestion:
http://wpdev.uservoice.com/forums/110705-dev-platform/suggestions/4715894-allow-unsafe-code

Second, generic pointers in C# would make dealing with low level memory management more modularizable and much less painful. I could do all sorts of fun optimizations in BEPUphysics without totally forking the codebase! Here's the related suggestion:
http://visualstudio.uservoice.com/forums/121579-visual-studio/suggestions/4716089-unmanaged-generic-type-constraint-generic-pointe

And, as always, if you want to see the above notifications and others in a slightly less readable format, there's my twitter account

Hey, where'd the dependency free version/XNA version go?

I've moved the Dependency Free fork into the main branch. Primary development from here on out will be on this dependency free version. Other forks will have changes merged in periodically (at least as often as major version releases).

If you're looking for the XNA version of BEPUphysics, head to the newly created XNA fork.

I still haven't gotten around to MonoGame-izing the BEPUphysicsDemos and BEPUphysicsDrawer, though, so they still rely on XNA.

BEPUphysics and XNA

As some of you may have noticed, there's been a recent uptick in the discussion of XNA's slow fading. As the main branch of BEPUphysics is still based on XNA, I think it would be appropriate to discuss the path of BEPUphysics.

For those of you who are not aware, BEPUphysics has long had multiple forks based on different libraries for easy math interoperation. The official forks include SlimDX, SharpDX, and the dependency free fork (and of course the main development fork, which is currently XNA). The dependency free fork has its own math systems and does not depend on anything beyond .NET or Mono's libraries.

As mentioned/hidden away in the version roadmap, one of the next two major packaged releases of BEPUphysics will move over to the dependency free fork. Already, you can see progress in this direction; the dependency free math utilities have been reorganized and expanded. Internally, we are already using the dependency free fork utilities for our projects.

Expect the swap to occur in the next six months as my procrastination is overridden by 1) internal development and 2) the fact that I haven't released a proper packaged version since May 2012. (The latter of which has lead to some confusion about whether development has halted- development continues and will continue!)

The most likely target frameworks for the rewritten demos will be either SharpDX or, for wider use, MonoGame.

There have also been some discussions about WinRT versions of BEPUphysics. WinRT is not a targeted platform for our internal projects, so it's a lower priority. However, there are a variety of little annoyances and API changes (particularly with threading) which make it nontrivial for people to pull it into WinRT projects; it would be nice to solve this in one spot so there isn't a bunch of redundant work being done. I will likely get around to it eventually- assuming someone else doesn't maintain a fork for it :) Anyone? :) :)   :)   :) :) :)      :) :) :)    <:)

 

Don't forget to follow me on twitter so you can see a notification on twitter about this blog post!