reublog

Deploying Wyam To GitHub Using Visual Studio Online

Reuben Bond — Tue, 03 Oct 2017 00:00:00 GMT

Here goes nothing! This blog is built with Dave Glick's Wyam static site generator and deployed from a git repo in Visual Studio Online to GitHub Pages. Here's how to set up something similar.

Prerequisites

A Visual Studio Online repository for your blog source.
- You could have also VSO pull the source from GitHub or somewhere else instead, but I haven't covered that here.
A GitHub repository which will serve the compiled output via GitHub Pages.
- I created a repository called reubenbond.github.io under my profile, ReubenBond.
Cake so you can test it out locally. Install it via Chocolatey: choco install cake.portable

Kick-starting Wyam with Cake

Create a file called build.cake in the root of your repo with these contents:

#tool nuget:?package=Wyam
#addin nuget:?package=Cake.Wyam

var target = Argument("target", "Default");

Task("Build")
   .Does(() =>
   {
       Wyam(new WyamSettings
       {
           Recipe = "Blog",
           Theme = "CleanBlog",
           UpdatePackages = true
       });
   });
   
Task("Preview")
   .Does(() =>
   {
       Wyam(new WyamSettings
       {
           Recipe = "Blog",
           Theme = "CleanBlog",
           UpdatePackages = true,
           Preview = true,
           Watch = true
       });        
   });

Task("Default")
   .IsDependentOn("Build");    
   
RunTarget(target);

Add a file called config.wyam like so:

#recipe Blog
#theme CleanBlog

Settings[Keys.Host] = "yourname.github.io";
Settings[BlogKeys.Title] = "MegaBlog";
Settings[BlogKeys.Description] = "Blog of the Gods";

Create a folder called input and add a folder called posts inside that. Now create input/posts/fist-post.md:

Title: Fist Post! A song of fice and ire
Published: 10/30/2017
Tags: ['Fists']
---

This post is about fists and how clumpy they always are.

Great! Try running it using Cake. Because Wyam targets an older version of Cake at the time of writing, I'm adding the --settings_skipverification=true option so that Cake doesn't complain.

cake --settings_skipverification=true -target=Preview

Open a browser to http://localhost:5080 and see the results. The Preview target watches for file changes so it can automatically recompile & refresh your browser whenever you save changes.

Automating Deployment

Install the Cake build task from the Visual Studio Marketplace into VSO.
In Visual Studio Online, create a new, empty build for your repo, selecting an appropriate build agent.
Add the Cake Build task.
Select the build.cake file from the root of your repo as the Cake Script.
Set the Target to Default.
Optionally add the --settings_skipverification=true option to Cake Arguments.
Add a new PowerShell Script build task, set Type to Inline Script and add these contents:

param (
  [string]$Token,
  [string]$UserName,
  [string]$Repository
)

$localFolder = "gh-pages"
$repo = "https://$($UserName):$($Token)@github.com/$($Repository).git"
git clone $repo --branch=master $localFolder

Copy-Item "output\*" $localFolder -recurse

Set-Location $localFolder
git add *
git commit -m "Update."
git push

Create a new GitHub Personal Access token from GitHub's Developer Settings page, or by clicking here. I added all of the repo permissions to the token.
In VSO, add arguments for the script, replacing TOKEN with your token and replacing the other values as appropriate:

-Token TOKEN -UserName "ReubenBond" -Repository "ReubenBond/reubenbond.github.io"

Up on the Triggers pane, enable Continuous Integration.
Click Save & queue, then cross your fingers.

Hopefully that's it and you can now add new blog posts to the input/posts directory.

Code Generation on .NET

Reuben Bond — Wed, 01 Nov 2017 00:00:00 GMT

This is the first part in what's hopefully a series of short posts covering code generation on the .NET platform.

Almost every .NET application relies on code generation in some form, usually because they rely on a library which generates code as a part of how it functions. Eg, Json.NET leverages code generation and so does ASP.NET, Entity Framework, Orleans, most serialization libraries, many dependency injection libraries, and probably every test mocking library.

Let's skip past why code generation is useful and jump straight into a high level overview of code generation technologies for .NET.

Kinds of Code Generation

The 3 code gen methods for .NET which we'll discuss are: Expression Trees, IL Generation, and Syntax Generation. There are other methods, such as text templating (eg using T4). Here are the pros and cons of each as I see them.

Expression Trees

Using LINQ Expression Trees to compile expressions at runtime.

:::tip Easy to use, expressive, and often the most approachable place to start when you need runtime code generation. :::

:::caution Expression trees are interpreted on AOT-only platforms like iOS, and some language constructs simply are not available. :::

IL Generation

Using Reflection.Emit to dynamically create types and methods using Common Intermediate Langage (known as CIL or just IL), which is the assembly language of the CLR.

:::tip IL generation can produce code which cannot be expressed in C#, such as direct access to private members. :::

:::warning The trade-off is ergonomics: IL is verbose, awkward to debug, difficult to use for higher-level features like async/await, and unavailable on AOT-only platforms. :::

Syntax Generation

Using Roslyn or some other API to generate C# syntax trees or source code and compile it either at runtime or when the target project is built.

:::tip Syntax generation gives you direct access to the full C# language and works well on AOT-only platforms because the output is plain source code. :::

:::note The API can feel indirect because it was designed for parsing and compilation first, not authoring, and runtime scenarios mean shipping Roslyn with your app. :::

Orleans

Microsoft Orleans uses the latter two approaches: IL and Roslyn. It uses Roslyn wherever possible, since it allows for easy access to C# language features like async and since it's easy to comprehend both the code generator and the generated code. Otherwise, IL generation is used for two things:

Generating code at runtime. For example ILSerializerGenerator generates serializers as a last resort for types which C# serializers couldn't be generated for (for example, private inner classes). It's a faster and less restricted alternative to .NET's BinaryFormatter.
Producing code which cannot be expressed in C#. For example, FieldUtils provides access to private fields and methods for serialization.

General Strategy

Regardless of which technology a library makes use of, code generation typically involves two phases:

Metadata Collection
- The code generator takes some input and creates an abstract representation of it in order to drive the code synthesis process.
- Eg, a library for deeply cloning objects might take a Type as input and generate an object describing each field in that type.
Code Synthesis
- The code generator uses the metadata model to drive the process of actually generating code (LINQ expressions, IL instructions, syntax tree nodes).
- Eg, our deep cloning library will generate a method which takes an object of the specified type from the metadata model and then recursively copy each of the fields.

The two phases can be merged for simple code generators. Orleans uses two phases. In phase 1, the input assembly is scanned and metadata is collected for types matching various criteria: Grain classes, Grain interfaces, serializable types, and custom serializer registrations. In phase 2, support classes are generated. For example, each grain interface has two classes generated: an RPC proxy and an RPC stub.

Conclusion

That's enough for now. Maybe next time we'll take a look at writing that hypothetical deep cloning library using IL generation. After that, we can take a look at a serialization library I've been working on which uses Roslyn for both metadata collection and syntax generation. If either of those things are interesting to you, let me know here or on Twitter.

:::important If IL generation is the piece you want to see in practice, the next post walks through a deep-copy implementation step by step. :::

Next Post: .NET IL Generation - Writing DeepCopy

.NET IL Generation - Writing DeepCopy

Reuben Bond — Sat, 04 Nov 2017 00:00:00 GMT

This is the second part in a series of short posts covering code generation on the .NET platform.

IL Generation

Last time, we skimmed over some methods to generate code on .NET and one of them was emitting IL. IL generation lets us circumvent the rules C# and other languages put in place to protect us from our own stupidity. Without those rules, we can implement all kinds of fancy foot guns. Rules like “don't access private members of foreign types” and “don't modify readonly fields”. That last one is interesting: C#'s readonly translates into initonly on the IL/metadata level so theoretically we shouldn't be able to modify those fields even using IL. As a matter of fact we can, but it comes at a cost: our IL will no longer be verifiable. That means that certain tools will bark at you if you try to write IL code which commits this sin, tools such as PEVerify and ILVerify. Verifiable code also has ramifications for Security-Transparent Code. Thankfully for us, Code Access Security and Security Transparent Code don't exist in .NET Core and they usually don't cause issue for .NET Framework.

Enough stalling, onto our mission briefing.

DeepCopy

Today we're going to implement the guts of a library for creating deep copies of objects. Essentially it provides one method:

public static T Copy<T>(T original);

Our library will be called DeepCopy and the source is up on GitHub at ReubenBond/DeepCopy feel free to mess about with it. The majority of the code was adapted from the Orleans codebase.

Deep copying is important for frameworks such as Orleans, since it allows us to safely send mutable objects between grains on the same node without having to first serialize & then deserialze them, among other things. Of course, immutable objects (such as strings) are shared without copying. Oddly enough, serializing then deserializing an object is the accepted Stack Overflow answer to the question of “how can I deep copy an object?”.

Let's see if we can fix that.

Battle Plan

The Copy method will recursively copy every field in the input object into a new instance of the same type. It must be able to deal with multiple references to the same object, so that if the user provides an object which contains a reference to itself then the result will also contain a reference to itself. That means we'll need to perform reference tracking. That's easy to do: we maintain a Dictionary<object, object> which maps from original object to copy object. Our main Copy<T>(T orig) method will call into a helper method with that dictionary as a parameter:

public static T Copy<T>(T original, CopyContext context)
{
  /* TODO: implementation */
}

The copy routine is roughly as follows:

If the input is null, return null.
If the input has already been copied (or is currently being copied), return its copy.
If the input is 'immutable', return the input.
If the input is an array, copy each element into a new array and return it.
Create a new instance of the input type and recursively copy each field from the input to the output and return it.

Our definition of immutable is simple: the type is either a primitive or it's marked using a special [Immutable] attribute. More elaborate immutability could be probably be soundly implemented, so submit a PR if you've improved upon it.

Everything but the last step in our routine is simple enough to do without generating code. The last step, recursively copying each field, can be performed using reflection to get and set field values. Reflection is a real performance killer on the hot path, though, and so we're going to go our own route using IL.

Diving Into The Code

The main IL generation in DeepCopy occurs inside CopierGenerator.cs in the CreateCopier<T>(Type type) method. Let's walk through it:

First we create a new DynamicMethod which will hold the IL code we emit. We have to tell DynamicMethod what the signature of the type we're creating is. In our case, it's a generic delegate type, delegate T DeepCopyDelegate<T>(T original, CopyContext context). Then we get the ILGenerator for the method so that we can begin emitting IL code to it.

var dynamicMethod = new DynamicMethod(
    type.Name + "DeepCopier",
    typeof(T), // The return type of the delegate
    new[] {typeof(T), typeof(CopyContext)}, // The parameter types of the delegate.
    typeof(CopierGenerator).Module,
    true);

var il = dynamicMethod.GetILGenerator();

The IL is going to be rather complicated because it needs to deal with immutable types and value types, but let's walk through it bit-by-bit.

// Declare a variable to store the result.
il.DeclareLocal(type);

Next we need to initialize our new local variable to a new instance of the input type. There are 3 cases to consider, each corresponding to a block in the following code:

The type is a value type (struct). Initialize it by essentially using a default(T) expression.
The type has a parameterless constructor. Initialize it by calling new T().
The type does not have a parameterless constructor. In this case we ask the framework for help and we call FormatterServices.GetUninitializedObject(type).

// Construct the result.
var constructorInfo = type.GetConstructor(Type.EmptyTypes);
if (type.IsValueType)
{
    // Value types can be initialized directly.
    // C#: result = default(T);
    il.Emit(OpCodes.Ldloca_S, (byte)0);
    il.Emit(OpCodes.Initobj, type);
}
else if (constructorInfo != null)
{
    // If a default constructor exists, use that.
    // C#: result = new T();
    il.Emit(OpCodes.Newobj, constructorInfo);
    il.Emit(OpCodes.Stloc_0);
}
else
{
    // If no default constructor exists, create an instance using GetUninitializedObject
    // C#: result = (T)FormatterServices.GetUninitializedObject(type);
    var field = this.fieldBuilder.GetOrCreateStaticField(type);
    il.Emit(OpCodes.Ldsfld, field);
    il.Emit(OpCodes.Call, this.methodInfos.GetUninitializedObject);
    il.Emit(OpCodes.Castclass, type);
    il.Emit(OpCodes.Stloc_0);
}

Interlude - What IL Should We Emit?

Even if you're not a first-timer with IL, it's not always easy to work out what IL you need to emit to achieve the desired result. This is where tools come in to help you. Personally I typically write my code in C# first, slap it into LINQPad, hit run and open the IL tab in the output. It's great for experimenting.

Another option is to use a decompiler/disassembler like JetBrains' dotPeek. You would compile your assembly and open it in dotPeek to reveal the IL.

Finally, if you're like me, then ReSharper is indispensible. It's like coding on rails (train tracks, not Ruby). ReSharper comes with a convenient IL Viewer.

Alright, so that's how you work out what IL to generate. You'll occasionally want to visit the docs, too.

Back To Emit

Now we have a new instance of the input type stored in our local result variable. Before we do anything else, we must record the newly created reference. We push each argument onto the stack in order and use the non-virtual Call op-code to invoke context.RecordObject(original, result). We can use the non-virtual Call op-code to call CopyContext.RecordObject because CopyContext is a sealed class. If it wasn't, we would use Callvirt instead.

// An instance of a value types can never appear multiple times in an object graph,
// so only record reference types in the context.
if (!type.IsValueType)
{
    // Record the object.
    // C#: context.RecordObject(original, result);
    il.Emit(OpCodes.Ldarg_1); // context
    il.Emit(OpCodes.Ldarg_0); // original
    il.Emit(OpCodes.Ldloc_0); // result, i.e, the copy of original
    il.Emit(OpCodes.Call, this.methodInfos.RecordObject);
}

On to the meat of our generator! With the accounting out of the way, we can enumerate over each field and generate code to copy each one into our result variable. The comments narrate the process:

// Copy each field.
foreach (var field in this.copyPolicy.GetCopyableFields(type))
{
    // Load a reference to the result.
    if (type.IsValueType)
    {
        // Value types need to be loaded by address rather than copied onto the stack.
        il.Emit(OpCodes.Ldloca_S, (byte)0);
    }
    else
    {
        il.Emit(OpCodes.Ldloc_0);
    }

    // Load the field from the result.
    il.Emit(OpCodes.Ldarg_0);
    il.Emit(OpCodes.Ldfld, field);

    // Deep-copy the field if needed, otherwise just leave it as-is.
    if (!this.copyPolicy.IsShallowCopyable(field.FieldType))
    {
        // Copy the field using the generic Copy<T> method.
        // C#: Copy<T>(field)
        il.Emit(OpCodes.Ldarg_1);
        il.Emit(OpCodes.Call, this.methodInfos.CopyInner.MakeGenericMethod(field.FieldType));
    }

    // Store the copy of the field on the result.
    il.Emit(OpCodes.Stfld, field);
}

Return the result and build our delegate using CreateDelegate so that we can start using it immediately.

// C#: return result;
il.Emit(OpCodes.Ldloc_0);
il.Emit(OpCodes.Ret);
return dynamicMethod.CreateDelegate(typeof(DeepCopyDelegate<T>)) as DeepCopyDelegate<T>;

That's the guts of the library. Of course many details were left out, such as:

Caching Type values in static fields so that we can reference them from our generated code. See StaticFieldBuilder.cs.
The special handling of arrays in DeepCopier.cs.
Optimizations such as using CachedReadConcurrentDictionary<TKey, TValue> for a slight improvement over ConcurrentDictionary<TKey, TValue> for workloads with a diminishing write volume.

Performance Tuning for .NET Core

Reuben Bond — Tue, 15 Jan 2019 00:00:00 GMT

Some of you may know I've been spending whatever time I can scrounge together grinding away at a new serialization library for .NET. Serializers can be complicated beasts. They have to be reliable, flexible, and fast beyond reproach. I won't convince you that serialization libraries have to be quick — in this post, that's a given. These are some tips from my experience in optimizing Hagar's performance. Most of this advice is applicable to other types of libraries or applications.

A post on performance should have minimal overhead and get straight to the point, so this post focuses on tips to help you and things to look out for. Message me on Twitter if something is unclear or you have something to add.

:::note This post is intentionally heuristic-heavy: each tip is aimed at hot paths where nanoseconds and allocations add up under load. :::

Maximize profitable inlining

Inlining is the technique where a method body is copied to the call site so that we can avoid the cost of jumping, argument passing, and register saving/restoring. In addition to saving those costs, inlining is a requirement for other optimizations. Roslyn (C#'s compiler) does not inline code. Instead, it is the responsibility of the JIT, as are most optimizations.

:::tip Inlining is rarely about saving the call itself. The bigger win is that once a method is inlined, the JIT can see more of the surrounding code and unlock other optimizations. :::

Use static throw helpers

A recent change which involved a significant refactor added around 20ns to the call duration for the serialization benchmark, increasing times from ~130ns to ~150ns (which is significant).

The culprit was the throw statement added in this helper method:

public static Writer<TBufferWriter> CreateWriter<TBufferWriter>(
    this TBufferWriter buffer,
    SerializerSession session) where TBufferWriter : IBufferWriter<byte>
{
    if (session == null) throw new ArgumentNullException(nameof(session));
    return new Writer<TBufferWriter>(buffer, session);
}

When a method contains a throw statement, the JIT will not inline it. The common trick to solve this is to add a static "throw helper" method to do the dirty work for you, so the end result looks like this:

public static Writer<TBufferWriter> CreateWriter<TBufferWriter>(
    this TBufferWriter buffer,
    SerializerSession session) where TBufferWriter : IBufferWriter<byte>
{
    if (session == null) ThrowSessionNull();
    return new Writer<TBufferWriter>(buffer, session);

    void ThrowSessionNull() => throw new ArgumentNullException(nameof(session));
}

Crisis averted. The codebase uses this trick in many places. Having the throw statement in a separate method may have other benefits such as improving the locality of your commonly used code paths, but I'm unsure and haven't measured the impact.

Minimize virtual/interface calls

Virtual calls are slower than direct calls. If you're writing a performance critical system then there's a good chance you'll see virtual call overhead show up in the profiler. For one, virtual calls require indirection.

Devirtualization is a feature of many JIT Compilers, and RyuJIT is no exception. It's a complicated feature, though, and there are not many cases where RyuJIT can currently prove (to itself) that a method can be devirtualized and therefore become a candidate for inlining. Here are a couple of general tips for taking advantage of devirtualization, but I'm sure there are more (so let me know if you have any).

Mark classes as sealed by default. When a class/method is marked as sealed, RyuJIT can take that into account and is likely able to inline a method call.
Mark override methods as sealed if possible.
Use concrete types instead of interfaces. Concrete types give the JIT more information, so it has a better chance of being able to inline your call.
Instantiate and use non-sealed objects in the same method (rather than having a 'create' method). RyuJIT can devirtualize non-sealed method calls when the type is definitely known, such as immediately after construction.
Use generic type constraints for polymorphic types so that they can be specialized using a concrete type and interface calls can be devirtualized. In Hagar, our core writer type is defined as follows:

public ref struct Writer<TBufferWriter> where TBufferWriter : IBufferWriter<byte>
{
    private TBufferWriter output;
    // --- etc ---

All calls to methods on output in the CIL which Roslyn emits will be preceded by a constrained instruction which tells the JIT that instead of making a virtual/interface call, the call can be made to the precise method defined on TBufferWriter. This helps with devirtualization. All calls to methods defined on output are successfully devirtualized as a result. Here's a CoreCLR thread by Andy Ayers on the JIT team which details current and future work for devirtualization.

Reduce allocations

.NET's garbage collector is a remarkable piece of engineering. GC allows for algorithmic optimizations for some lock-free data structures and also removes whole classes of bugs and lightens the developer's cognitive load. All things considered, garbage collection is a tremendously successful technique for memory management.

However, while the GC is a powerful work horse, it helps to lighten its load not only because it means your application will pause for collection less often (and more generally, less CPU time will be devoted to GC work), but also because lightening working set is beneficial for cache locality.

The rule-of-thumb for allocations is that they should either die in the first generation (Gen0) or live forever in the last (Gen2).

:::important A useful rule of thumb is that allocations should either die young in Gen0 or live long enough to justify promotion. The awkward middle is where GC overhead tends to hurt. :::

.NET uses a bump allocator where each thread allocates objects from its per-thread context by 'bumping' a pointer. For this reason, better cache locality can be achieved for short-lived allocations when they are allocated and used on the same thread.

For more info on .NET's GC, see Matt Warren's blog post series, Learning How Garbage Collectors Work here and pre-order Konrad Kokosa's book, Pro .NET Memory Management here. Also check out his fantastic free .NET memory management poster here, it's a great reference.

Pool buffers/objects

Hagar itself doesn't manage buffers but instead defers the responsibility to the user. This might sound onerous but it's not, since it's compatible with System.IO.Pipelines. Therefore, we can take advantage of the high performance buffer pooling which the default Pipe provides by means of System.Buffers.ArrayPool<T>.

Generally speaking, reusing buffers lets you put much less pressure on the GC - your users will be thankful. Don't write your own buffer pool, unless you truly need to, though - those times have passed.

:::caution Reach for ArrayPool<T> or System.IO.Pipelines before building your own pool. Custom pooling code is easy to get subtly wrong and hard to benchmark honestly. :::

Avoid boxing

Wherever possible, do not box value types by casting them to a reference type. This is common advice, but it requires some consideration in your API design. In Hagar, interface and method definitions which might accept value types are made generic so that they can be specialized to the precise type and avoid boxing/unboxing costs. As a result, there is no hot-path boxing. Boxing is still present in some cases, such as string formatting for exception methods. Those particular boxing allocations can be removed by explicit .ToString() calls on the arguments.

:::warning Boxing on a hot path is easy to miss because the code still looks clean. Generic APIs often pay for themselves here by letting the JIT specialize away the allocation. :::

Reduce closure allocations

Allocate closures only once and store the result for repeated use. For example, it's common to pass a delegate to ConcurrentDictionary<K, V>.GetOrAdd. Instead of writing the delegate as an inline lambda, define is as a private field on the class. Here an example from the optional ISerializable support package in Hagar:

private readonly Func<Type, Action<object, SerializationInfo, StreamingContext>> createConstructorDelegate;

public ObjectSerializer(SerializationConstructorFactory constructorFactory)
{
    // Other parameters/statements omitted.
    this.createConstructorDelegate = constructorFactory.GetSerializationConstructorDelegate;
}

// Later, on a hot code path:
var constructor = this.constructors.GetOrAdd(info.ObjectType, this.createConstructorDelegate);

Minimize copying

.NET Core 2.0 and 2.1 and recent C# versions have made considerable strides in allowing library developers to eliminate data copying. The most notable addition is Span<T>, but it's also worth mentioning in parameter modifiers and readonly struct.

Use `Span<T>` to avoid array allocations and avoid data copying

Span<T> and friends are a gigantic performance win for .NET, particularly .NET Core where they use an optimized representation to reduce their size, which required adding GC support for interior pointers. Interior pointers are managed references which point to within the bounds of an array, as opposed to only being able to point to the first element and therefore requiring an additional field containing an offset into the array. For more info on Span<T> and friends, read Stephen Toub's article, All About Span: Exploring a New .NET Mainstay here.

Hagar makes extensive use of Span<T> because it allows us to cheaply create views over small sections of larger buffers to work with. Enough has been written on the subject that there's no use me writing more here.

Pass structs by `ref` to minimize on-stack copies

Hagar uses two main structs, Reader & Writer<TOutputBuffer>. These structs each contain several fields and are passed to almost every call along the serialization/deserialization call path.

Without intervention, each method call made with these structs would carry significant weight since the entire struct would need to be copied onto the stack for every call, not to mention any mutations would need to be copied back to the caller.

We can avoid that cost by passing these structs as ref parameters. C# also supports using ref this as the target for an extension method, which is very convenient. As far as I know, there's no way to ensure that a particular struct type is always passed by ref and this can lead to subtle bugs if you accidentally omit ref in the parameter list of a call, since the struct will be silently copied and modifications made by a method (eg, advancing a write pointer) will be lost.

Avoid defensive copies

Roslyn has to do some work to guarantee some language invariants sometimes. When a struct is stored in a readonly field, the compiler will insert instructions to defensively copy that field before involving it in any operation which isn't guaranteed to not mutate it. Typically this means calls to method defined on the struct type itself because passing a struct as argument to a method defined on another type already requires copying the struct onto the stack (unless it's passed by ref or in).

This defensive copy can be avoided if the struct is defined as a readonly struct, which is a C# 7.2 language feature, enabled by adding <LangVersion>7.2</LangVersion> to your csproj file.

Sometimes it's better to omit the readonly modifier on an otherwise immutable struct field if you are unable to define it as a readonly struct.

See Jon Skeet's NodaTime library as an example. In this PR, Jon made most structs readonly and was therefore able to add the readonly modifier to fields holding those structs without negatively impacting performance.

Reduce branching & branch misprediction

Modern CPUs rely on having long pipelines of instructions which are processed with some concurrency. This involves the CPU analyzing instructions to determine which ones aren't reliant on previous instructions and also involves guessing which conditional jump statements are going to be taken. In order to do this, the CPU uses a component called the branch predictor which is responsible for guessing which branch will be taken. It typically does this by reading & writing entries in a table, revising its prediction based upon what happened last time the conditional jump was executed.

When it guesses correctly, this prediction process provides a substantial speedup. When it mispredicts the branch (jump target), however, it needs to throw out all of the work performed in processing instructions after the branch and re-fill the pipeline with instructions from the correct branch before continuing execution.

The fastest branch is no branch. First try to minimize the number of branches, always measuring whether or not your alternative is faster. When you cannot eliminate a branch, try to minimize misprediction rates. This may involve using sorted data or restructuring your code.

One strategy for eliminating a branch is to replace it with a lookup. Sometimes an algorithm can be made branch-free instead of using conditionals. Sometimes hardware intrinsics can be used to eliminate branching.

Other miscellaneous tips

Avoid LINQ. LINQ is great in application code, but rarely belongs on a hot path in library/framework code. LINQ is difficult for the JIT to optimize (IEnumerable<T>...) and tends to be allocation-happy.
Use concrete types instead of interfaces or abstract types. This was mentioned above in the context of inlining, but this has other benefits. Perhaps the most common being that if you are iterating over a List<T>, it's best to not cast that list to IEnumerable<T> first (eg, by using LINQ or passing it to a method as an IEnumerable<T> parameter). The reason for this is that enumerating over a list using foreach uses a non-allocating List<T>.Enumerator struct, but when it's cast to IEnumerable<T>, that struct must be boxed to IEnumerator<T> for foreach.
Reflection is exceptionally useful in library code, but it will kill you if you give it the chance. Cache the results of reflection, consider generating delegates for accessors using IL or Roslyn, or better yet, use an existing library such as Microsoft.Extensions.ObjectMethodExecutor.Sources, Microsoft.Extensions.PropertyHelper.Sources, or FastMember.

Library-specific optimizations

Optimize generated code

Hagar uses Roslyn to generate C# code for the POCOs you want to serialize, and this C# code is included in your project at compile time. There are some optimizations which we can perform on the generated code to make things faster.

Avoid virtual calls by skipping codec lookup for well-known types

When complex objects contain well known fields such as int, Guid, string, the code generator will directly insert calls to the hand-coded codecs for those types instead of calling into the CodecProvider to retrieve an IFieldCodec<T> instance for that type. This lets the JIT inline those calls and avoids virtual/interface indirection.

(Unimplemented) Specialize generic types at runtime

Similar to above, the code generator could generate code which uses specialization at runtime.

Pre-compute constant values to eliminate some branching

During serialization, each field is prefixed with a header – usually a single byte – which tells the deserializer which field was encoded. This field header contains 3 pieces of info: the wire type of the field (fixed-width, length-prefixed, tag-delimited, referenced, etc), the schema type of the field (expected, well-known, previously-defined, encoded) which is used for polymorphism, and dedicates the last 3 bits to encoding the field id (if it's less than 7). In many cases, it's possible to know exactly what this header byte will be at compile time. If a field has a value type, then we know that the runtime type can never differ from the field type and we always know the field id.

Therefore, we can often save all of the work required to compute the header value and can directly embed it into code as a constant. This saves branching and generally eliminates a lot of IL code.

Choose appropriate data structures

One of the big performance disadvantages Hagar has when compared to other serializers such as protobuf-net (in its default configuration?) and MessagePack-CSharp is that it supports cyclic graphs and therefore must track objects as they're serialized so that object cycles are not lost during deserialization. When this was first implemented, the core data structure was a Dictionary<object, int>. It was clear in initial benchmarking that reference tracking was a dominating cost. In particular, clearing the dictionary between messages was expensive. By switching to an array of structs instead, the cost of indexing and maintaining the collection is largely eliminated and reference tracking no longer appears in the benchmarks. There is a downside to this: for large object graphs it's likely that this new approach is slower. If that becomes an issue, we can decide to dynamically switch between implementations.

Choose appropriate algorithms

Hagar spends a lot of time encoding/decoding variable-length integers, often referred to as varints, in order to reduce the size of the payload (which can be more compact for storage/transport). Many binary serializers use this technique, including Protocol Buffers. Even .NET's BinaryWriter uses this encoding. Here's a snippet from the reference source:

protected void Write7BitEncodedInt(int value) {
    // Write out an int 7 bits at a time.  The high bit of the byte,
    // when on, tells reader to continue reading more bytes.
    uint v = (uint) value;   // support negative numbers
    while (v >= 0x80) {
        Write((byte) (v | 0x80));
        v >>= 7;
    }
    Write((byte)v);
}

Looking at this source, I want to point out that ZigZag encoding may be more efficient for signed integers which contain negative values, rather than casting to uint.

VarInts in these serializers use an algorithm called Little Endian Base-128 or LEB128, which encodes up to 7 bits per byte. It uses the most significant bit of each byte to indicate whether or not another byte follows (1 = yes, 0 = no). This is a simple format but it may not be the fastest. It might turn out that PrefixVarint is faster. With PrefixVarint, all of those 1s from LEB128 are written in one shot, at the beginning of the payload. This may let us use hardware intrinsics to improve the speed of this encoding & decoding. By moving the size information to the front, we may also be able to read more bytes at a time from the payload, reducing internal bookkeeping and improving performance. If someone wants to implement this in C#, I will happily take a PR if it turns out to be faster.

Hopefully you've found something useful in this post. Let me know if something is unclear or you have something to add. Since I started writing this, I've moved to Redmond and officially joined Microsoft on the Orleans team, working on some very exciting things.

CASPaxos

Reuben Bond — Tue, 21 Jan 2020 00:00:00 GMT

Recently I've been playing around with a new algorithm known as CASPaxos. In this post I'm going to talk about the algorithm and its potential benefits for distributed databases, particularly key-value stores.

Distributed databases must be reliable and scalable. To achieve reliability, DBs replicate data to other servers. To achieve scalability in terms of total storage capacity, DBs must allow the data to be replicated to only a subset of servers - enough to make the data reasonably reliable but not so much that adding a new server does not increase the total storage capacity of the system or make the system unbearably slow. A typical replication factor is 3: each piece of data is stored on 3 servers. Replications is typically implemented using a consensus algorithm. Well-known algorithms in this family that are used for replication are Raft, Multi-Paxos, and ZAB (which is used in ZooKeeper). Those 3 algorithms make servers agree on the ordering of operations in a log. By executing those operations in order, the database engines on each server can create identical replicas of a database. Logs feature very prominently in distributed/reliable systems (Read The Log: What every software engineer should know about real-time data's unifying abstraction by Jay Kreps).

CASPaxos is a new algorithm in this space and it is significantly simpler than the aforementioned algorithms because it does not use log replication. It is a slight modification of the original Paxos algorithm, which is very simple and typically used as a minimal building block for more complicated algorithms such as Multi-Paxos. Instead of replicating log entries between servers, CASPaxos replicates entire values. Because of this, it is best suited for relatively small values, such as individual entries in a key-value store.

So why is this interesting? In short: it offers us simplicity & performance. Before getting into its benefits, here's a sloppy, inaccurate description of CASPaxos - I recommend you read the paper.

:::tip Why it stands out: CASPaxos replaces replicated logs with replicated values. That keeps the core protocol small and makes it a useful mental model before tackling full replicated-log systems. :::

CASPaxos

CASPaxos replicates changes to a single register amongst a set of replicas. The register holds a user-defined value which is modified by successive application of some change function (which is a closure). Each of these modifications are protected by version stamps (ballot numbers) which help to ensure that previously committed register values are not clobbered without being first observed by the writer. The protocol facilitates learning previously committed values so that replicas can keep up with one another.

If you are familiar with Raft, you will know that at its core it replicates a log of values. Conceptually, a log-based replicated state machine folds a fixed function over multiple data (the log entries). By contrast, CASPaxos does not use a fixed function and instead folds varying closures over state, with the resulting state itself being replicated to other replicas.

To illustrate, the following expansions show the result of applying [e0, e1, e2] (log entries) in Raft, versus [f0, f1, f2] (closures) in CASPaxos:

Raft: state = f(e2, f(e1, f(e0, ∅)]))
CASPaxos: state = f2(f1(f0(∅)))

Aside from what gets replicated and how the current state of the system is computed, Raft and CASPaxos are vastly different. For example, CASPaxos is leaderless, whereas Raft uses a strong leader. CASPaxos does not specify the use of heartbeats (in the core algorithm), whereas Raft does. Many of these differences are present because Raft is a more batteries included algorithm which covers much of the practical concerns involved in building a replicated database.

Neither approach is strictly better than the other, but since the CASPaxos approach (replicating state values rather than log entries) was fairly novel to me in the context of distributed conensus, I'd like to explore some of the implications, especially as they might apply to the systems I work on.

Read the paper to understand the algorithm in more detail.

Simplicity

The canonical implementation of CASPaxos by its author Denis Rystsov (@rystsov) is Gryadka, a key-value store written in JavaScript which sits atop Redis. The core, including the CASPaxos implementation, has less than 500 lines of code. Raft was also designed to be a simple and understandable algorithm, but it carries with it the weight of log replication, which brings with it the need for log compaction, which brings with it the need for snapshotting and snapshot transfer. Raft also requires leadership elections because it is built around the concept of a "strong leader". All writes must be served by the single master in a Raft system, whereas writes can be served by any replica in a CASPaxos system. CASPaxos is simpler to implement than Raft. The extended Raft paper is a great read. Diego Ongaro's Ph. D dissertation includes an important simplification to the original paper's membership change algorithm. Let's be clear here: Raft definitely achieved its goal of understandability and it truly deserves the widespread adoption it's seen.

:::important What the simplicity buys you: if your workload looks like a replicated key-value store, fewer moving parts means less machinery for leader routing, log compaction, and snapshot transfer. :::

Storage Performance

To analyse the performance implications of CASPaxos, we need to take a little detour and discuss real-world systems. One great example is CockroachDB, a distributed SQL database. CockroachDB aims to be reliable and scalable. To achieve this, they partition their data and replicate each piece of data to a subset of the servers in the system using an algorithm they call MultiRaft. If they were to use a single Raft consensus group, then adding additional servers would not increase the total capacity of the database. If they use many Raft consensus groups naively, the overhead of each consensus group would have a toll on throughput. For example, Raft requires heartbeat messages while idle to maintain leadership. MultiRaft requires multiplexing each consensus group's log records on disk for performance. That means that log entries for each group might not live near each other on disk, since they are interspersed with many other groups' records. This may take a toll on recovery performance. The alternative is to store each group's log in contiguous disk segments, but this reduces write throughput: spinning disks and SSDs both perform better when operating sequentially. The optimizations required to make Raft scale well are tricky largely because of its log-based nature.

Speaking of storage, let's talk briefly about storage engines. The storage engine is the database component responsible for reading and writing data in a reliable way. Examples include RocksDB, LMDB, ESENT (used in Exchange & Active Directory), WiredTiger, TokuDB, and InnoDB. Two of the most common data structures for implementing a storage engine are B+ Trees and more recently, Log-Structured Merge-Trees (LSM trees). In order to make B+ Trees reliable (any machine may crash at any time), a Write-Ahead Log (WAL) is used. This log is a file containing a sequential list of the database transactions which are being performed. The storage engine eventually applies these transactions to the database image. During crash recovery, the storage engine reads this file and ensures that all of the committed transactions have been applied. This recovery algorithm is called ARIES and it can be found in many reliable storage engines. So B+ Trees split your data into two parts: a log file and a tree. Log-Structured Merge-Trees also generally adopt a Write-Ahead Log for recovery. Since spinning disks and SSDs perform best with sequential reads & writes, log files are a good fit for high-performance, reliable systems.

Raft is built around log replication, so it might make sense to integrate with the storage engine so that a single log can be used for both purposes: local durability as well as replication. Unfortunately, the storage engine's log is generally not visible to the storage engine consumer and is usually considered an implementation detail. This means that Raft implementations which use an off-the-shelf storage engine such as RocksDB must store log records inside the storage engine so that they can be read back later. The result is that each operation needs at least 2 writes (1 on the critical path): one for the log entry and one for the result of applying the log entry once it's committed (eg, updating a value in a key-value store). A B+ Tree engine needs 4 writes (1 critical). By contrast, CASPaxos needs just 1 write: updating the value itself. Log-based algorithms have natural write amplification where as CASPaxos does not.

By removing the need for logs, CASPaxos can achieve high write throughput with off-the-shelf storage engines.

:::tip Write amplification matters: when the storage engine already maintains its own WAL, layering a replicated log on top often means writing both the log entry and the materialized result. CASPaxos avoids that extra replicated-log layer. :::

Coordination

Each key in a key-value store based on CASPaxos is completely independent of all other keys. This means that no cross-key coordination is required when serving operations on individual keys. Compare this with Raft or MultiRaft where all operations within a given consensus group are strictly ordered. This ordering requires coordination which has some overhead. It means that a slow operation on one key can more easily impact operations on other keys. The low level of coordination required by CASPaxos supports high-concurrency systems without added complexity.

Coordination is sometimes required. For example, when implementing multi-object transactions. Multi-object transactions can be implemented as a higher layer on top of a key-value store with linearizable keys using two-phase commit (2PC). For example, this is how we implement ACID transactions in Orleans, supporting any strong consistency key-value store.

Challenges

So far we've talked about ways in which CASPaxos might be more suitable for building a distributed key-value store than Raft or MultiRaft. CASPaxos is a simple algorithm and there are many system design questions which are not addressed by the paper definition. So here are some potential challenges when building a real-world system on CASPaxos, as well as some thoughts on how to solve them.

:::warning The paper defines the core protocol, not a full production database. The remaining sections are the practical questions you still need to answer when turning CASPaxos into a complete system. :::

Server Catch-up

When adding a new server to the database system, the server needs to be brought up to speed with the existing servers. This requires adding it to the consensus group as well as copying all data for the keys which it will be replicating. The CASPaxos paper describes this process as a part of membership change. However, a similar process is needed to ensure that data is sufficiently reliable. For example, if a server loses network connectivity for a few seconds then it may miss some updates to some rarely updated keys. The CASPaxos algorithm does not discuss how to ensure that all updates are eventually replicated. In Raft, it is the leader's responsibility to keep followers up to speed. In a system built around CASPaxos, which is leaderless, we will likely need to implement a different solution.

Membership Change

The membership change algorithm in the paper does not offer safety in all cases and it implies a single administrator in the system. Therefore, it is not suitable for use with automated cluster management systems. The proof-of-concept CASPaxos implementation on Orleans, uses a different membership change algorithm. It ought to be suitable for automated systems (such a the cluster membership algorithm used in Orleans). I believe the algorithm will be safe once fully implemented, but that has not been demonstrated yet. The key idea is to leverage the consensus mechanism of the protocol for cluster membership change, similar to how Raft and Multi-Paxos commit configuration changes to the log. It uses a special purpose register to store cluster configuration. Proposers indicate which version of the configuration they are using in all calls to Acceptors and Acceptors reject requests from Proposers running old configurations. This is similar to Raft's notion of neutralizing old leaders. Additionally membership changes are restricted to at-most one server at a time, which is a special case of joint consensus. This the same restriction that Diego Ongaro specified in his Ph. D dissertation for Raft. In a sense, this extension turns CASPaxos into a 2-level store with the cluster configuration register at the top and data registers below, so the ballot vector is [configuration ballot, data ballot].

Scale-out

Adding additional servers should increase the total storage capacity of the system. CASPaxos specifies only the minimal building block of a key-value store, so this scale-out is not discussed in the paper. The Raft paper also does not specify this, motivating the development of MultiRaft for CockroachDB. The dynamic range-based partitioning scheme used by CockroachDB is a good candidate. Implementing this might involve storing range configurations in registers and extending the membership change modification to include 3 levels, [cluster ballot, range ballot, data ballot].

Large Values

CASPaxos is not suitable for replicating large values because each value is sent over the wire every time it is updated. For a replication factor of 3, the entire value is sent 3 times for every update and 6 times if the proposer cannot take advantage of the distinguished leader optimization.

This limitation could be alleviated in several ways, or it can be ignored and argued away, leaving users to tackle the problem themselves if they truly need large values.

:::caution CASPaxos shines when values are modest in size. If updates routinely move large blobs, the simplicity win can be eroded by network and storage bandwidth costs. :::

One way to alleviate it might be to split keys over several registers. Without going into detail, this might involve extending the membership change modification yet again to include 4 levels, at which point it may make sense to generalize it into a ballot vector, [...parent ballots, register ballot]. Specifically, [config ballot, range ballot, file ballot, register ballot]. At this point, the system is structured more like a tree than a flat key-value store.

Conclusion

I hope you've enjoyed the post. If you'd like to discuss any aspects of it, for example some glaring inaccuracies, drop me a line via Twitter (@ReubenBond).

Distributed Systems is a young field with many exciting areas for research and development.

Fast CASPaxos

Reuben Bond — Thu, 09 Apr 2026 00:00:00 GMT

:::note TL;DR: Fast CASPaxos adds leaderless commits to CASPaxos so that any proposer can commit an update to a shared register with a single round-trip to a quorum. I've written about CASPaxos before. :::

Introduction

Classic Paxos (aka "single-decree Paxos") lets a group of servers agree on a single, immutable value. It divides its responsibilities into two main roles: proposers, which initiate rounds and choose candidate values, and acceptors, which persist promises and accepted values. In most implementations, these roles are bundled together into every server, but it's common to talk about them as though they are separate processes and we won't deviate from the norm in this post.

CASPaxos extends Paxos into a rewritable register, so proposers can mutate the value over time while maintaining strong consistency guarantees (linearizability) so that no acknowledged write is ever clobbered. Classic Paxos requires two messaging round-trips to decide on a value. That is, it operates in 2 phases:

a prepare phase where a proposer tries to become leader and learns the current state from acceptors, and
an accept phase where that proposer asks acceptors to accept a value.

A common Paxos optimization is to have each accept message piggyback the next prepare message so the same proposer can update the register via a single accept+prepare message instead of needing two separate round-trips.

My contribution is Fast CASPaxos, which lets any proposer update the register in a single accept+prepare request, not only the proposer that already holds leadership. It works by leveraging an insight from Fast Paxos. After coming up with Fast CASPaxos and going through all of the work to propagandize myself about it through deterministic simulation testing, TLA+ models, showers, etc, I discovered that key ideas were already discovered by Heidi Howard and presented in her thesis titled Distributed consensus revised. Still, Fast CASPaxos brings together those ideas and the core idea from CASPaxos in a way that I believe is novel and it identifies one point along the Pareto frontier of optimal distributed consensus algorithms. An interesting side-point about Fast CASPaxos is that it's cheap to switch between leadered and leaderless consensus at runtime, so theoretically you could create an adaptive algorithm that lets you decide based on observed conditions (conflict rates, relative latency, available servers, etc).

The rest of the post covers leadered-vs-leaderless consensus, CASPaxos, and Fast CASPaxos.

Leader-based vs leaderless consensus

Classic Paxos, Multi-Paxos, Raft, and CASPaxos are all examples of leader-based consensus. In leader-based consensus, before a proposer can commit a value, it must first obtain mutually exclusive (but revocable) rights to do so. Leader election grants this right. In Paxos, the prepare phase is where a proposer tries to acquire that right and the accept phase is where it tries to commit a value. If leadership is uncontended then each phase typically takes 1 round-trip (one message back and forth between the proposer and acceptors). Multi-Paxos, Raft, and CASPaxos let the same proposer amortize the cost of becoming a leader by allowing it to continue committing values as long as it is uncontended. The key point is that leadership is about mutual exclusion: only one proposer at a time has the right to commit values. For CASPaxos, the gist of how this works is that every time you commit a value you piggyback leader election for the next value in the same message: each accept request has the next prepare piggy-backed on.

Leadered and leaderless consensus both have a one-round-trip (1 RTT) fast case, but they optimize for different failure and contention patterns. Choosing which is the right approach depends on the scenario - we'll talk about some cases later.

Leadered consensus lets only the leader commit in 1 RTT. If another proposer wants to commit, it must first become the leader (1 RTT) and then commit. That's 2 RTT in total if you were to have a different proposer for each commit. If proposers are trying to commit values concurrently, they can end up contending, possibly indefinitely, trying to squeeze those 2 RTT in before the other proposer deposes them as leader. This is the dueling proposers problem.

Leaderless consensus algorithms allow any proposer to commit a value without first obtaining exclusive rights. This is the crux of the Fast Paxos optimization: acceptors are prepared ahead of time for a shared fast round in which proposers can send accept requests directly to acceptors instead of first performing a prepare phase. If enough acceptors receive the same value, it commits in a single round-trip. If concurrent proposers propose conflicting values, the protocol falls back to classic recovery (prepare then accept). So Fast Paxos is leaderless in fast rounds but leadered during recovery.

The cost of that leaderless fast path is a larger quorum. A later classic prepare quorum must still be able to tell which value, if any, could have been committed in the fast round. In a classic ballot there is only one proposer, so the ballot identifies a single candidate value. In a fast ballot many proposers can race, so different acceptors may report different values for the same ballot. Recovery therefore has to group responses by value at the highest fast ballot it sees and choose the unique maximum, or any tied maximum. That is why fast rounds need a supermajority: not because the fast proposer needs extra votes for its own sake, but because later recovery must be unable to reinterpret the round as having decided a different value. Leaderless consensus is attractive in many scenarios, but those larger quorums and the risk of conflicts mean it generally performs worse than leadered consensus under contention, which is why the ability to switch between the two modes is appealing.

When each mode fits best

Leadered

High write rate. If one proposer is likely to drive a long sequence of updates, leadered mode amortizes prepare costs across many operations instead of paying them repeatedly.
High conflict rate. When independent proposers frequently want different next values, a leader reduces repeated proposer-versus-proposer collisions and gives more stable progress than optimistic fast rounds. To take advantage of this, you need to route requests to a stable leader.

Leaderless

Geo-distributed, low-conflict deployments. When network latency is high, leaderless 1 RTT commits are particularly attractive, but the penalty of contention is more pronounced.
Infrequent writes. When writes are infrequent and proposers have a chance to learn the latest committed value without going through consensus, fast rounds allow 1 RTT commits at any proposer.
Almost-everywhere agreement. When concurrent proposers are proposing the same value anyway, shared fast rounds let them proceed without dueling. The Rapid cluster membership algorithm uses Fast Paxos for exactly this property: Rapid's cut detector produces proposals based on observer alerts and delays action until churn stabilizes into the same multi-process cut, resulting in very low conflict rates among proposers, allowing them to commit in 1 RTT most of the time without going through a leader.
Register initialization. The initial fast ballot is implicitly prepared, so the first write can skip prepare entirely and go straight to accept. That makes 1 RTT initialization especially attractive when a system creates many lightweight registers which may only be written once.

When it's not clear-cut

I've seen arguments saying that the smaller quorum requirements of leadered protocols confer advantages like:

Reduced chance that a straggler server will slow down consensus
Improved failure masking (more servers can fail before progress stops)

Those are not unreasonable ideas, but in a leadered protocol, the leader itself can be the straggler and slow down all requests (a gray failure). If the leader crashes you need to first detect that, which is inherently slow, and then elect a new leader. That handoff naturally creates a stutter where progress pauses while the system re-establishes mutual exclusion. The system becomes temporarily unavailable during this period. Leaderless algorithms don't have these problems. Given that, these attributes are not clear-cut.

CASPaxos

CASPaxos implements a linearizable rewritable register. Instead of replicating an entire log of commands like Raft and MultiPaxos do, each operation has a proposer read the current register value, apply an update function locally, and replicate the resulting value to a quorum of acceptors. That makes it a good fit for small-ish blobs of strongly consistent state such as configuration, leases, and membership metadata.

Every update is protected by a ballot number. Before a proposer can overwrite the register, it has to learn the highest value which might still matter, so previously committed work cannot be clobbered blindly.

The protocol is still classic Paxos-shaped: first prepare a ballot, then accept a value at that ballot. The proposer can piggyback the next ballot's prepare onto its successful accept, letting the same proposer stay on a 1 RTT steady-state path.

Role-local state

Proposer state (volatile)

id: unique proposer identifier
round: monotonically increasing, initially 0
prepared: most recently prepared ballot, initially null
cachedVal: last decided value, initially null

Acceptor state (persistent)

promised: highest ballot promised
accepted: last accepted ballot, or null
value: last accepted value, or null

Phase 1: prepare

The proposer picks a fresh ballot b = (round, id) and sends Prepare(b) to acceptors. An acceptor compares b with its promised. If b is high enough, it raises promised to b and replies with Promise(accepted, value). Otherwise it replies with Reject(promised).

To succeed, the proposer needs a quorum of Promise responses. If some acceptor reports a higher ballot via Reject(promised), the proposer does not get to ignore it and move on. It must move round above the highest promised ballot it saw and retry.

Once prepare succeeds, the proposer recovers the latest value which might matter:

if no acceptor in the quorum has accepted anything, then cachedVal = ⊥
otherwise, the proposer sets cachedVal to the value paired with the highest accepted ballot it observed

That recovered cachedVal is the safe input to the client's update function.

Phase 2: accept

After prepare, the proposer computes newVal = update(cachedVal) and sends Accept(b, newVal, nextBallot) to acceptors.

An acceptor rejects if b is below its promised. Otherwise it:

stores accepted = b
stores value = newVal
raises promised = max(b, nextBallot)

The value is committed once a quorum accepts it.

That optional nextBallot field is the piggybacked-prepare optimization. Instead of doing:

prepare this round
accept this round
prepare again next time

the proposer does:

prepare this round
accept this round and piggyback the next ballot

so the same proposer can cache prepared = nextBallot and skip the next standalone prepare.

<figure class="protocol-cheatsheet"> <div class="protocol-cheatsheet-columns"> <div class="protocol-cheatsheet-column"> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/caspaxos-state.png" data-lightbox data-lightbox-caption="CASPaxos cheatsheet: State" aria-label="Expand CASPaxos cheatsheet state box" > <img src="/images/protocol-cheatsheets/caspaxos-state.png" alt="CASPaxos cheatsheet box titled State" loading="lazy" /> </a> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/caspaxos-rules-for-proposers.png" data-lightbox data-lightbox-caption="CASPaxos cheatsheet: Rules for Proposers" aria-label="Expand CASPaxos cheatsheet rules for proposers box" > <img src="/images/protocol-cheatsheets/caspaxos-rules-for-proposers.png" alt="CASPaxos cheatsheet box titled Rules for Proposers" loading="lazy" /> </a> </div> <div class="protocol-cheatsheet-column"> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/caspaxos-ballots.png" data-lightbox data-lightbox-caption="CASPaxos cheatsheet: Ballots" aria-label="Expand CASPaxos cheatsheet ballots box" > <img src="/images/protocol-cheatsheets/caspaxos-ballots.png" alt="CASPaxos cheatsheet box titled Ballots" loading="lazy" /> </a> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/caspaxos-prepare-rpc.png" data-lightbox data-lightbox-caption="CASPaxos cheatsheet: Prepare RPC" aria-label="Expand CASPaxos cheatsheet prepare RPC box" > <img src="/images/protocol-cheatsheets/caspaxos-prepare-rpc.png" alt="CASPaxos cheatsheet box titled Prepare RPC" loading="lazy" /> </a> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/caspaxos-accept-rpc.png" data-lightbox data-lightbox-caption="CASPaxos cheatsheet: Accept RPC" aria-label="Expand CASPaxos cheatsheet accept RPC box" > <img src="/images/protocol-cheatsheets/caspaxos-accept-rpc.png" alt="CASPaxos cheatsheet box titled Accept RPC" loading="lazy" /> </a> </div> </div> <figcaption> CASPaxos cheatsheet </figcaption> </figure>

The important bits are that prepare is what protects prior work by forcing a proposer to learn the highest value which might still matter, accept is the only step which actually commits a new value, and piggybacking the next ballot does not commit anything by itself; it only prepares the following round.

Fast CASPaxos

Everything above used classic proposer-owned ballots. Fast CASPaxos keeps the same proposer and acceptor roles, the same persistent acceptor state, and the same prepare/accept structure. The core changes relative to CASPaxos are:

ballots with proposerId = 0 are shared fast ballots, so any proposer may use them
a piggybacked next ballot can therefore prepare either the next proposer-owned classic ballot or the next shared fast ballot
prepare still uses a classic majority, but accept at a fast ballot needs a larger fast quorum
within a fast ballot, acceptors are first-write-wins: they accept the first value they see at that ballot and reject later different values
if a later prepare sees that the highest accepted ballot was fast, it tallies values at that ballot and carries forward the unique maximum, or any tied maximum

That is enough to turn CASPaxos into Fast CASPaxos without changing acceptor state. If two proposers send different values in the same fast ballot, neither reaches the fast quorum, so the protocol falls back to a classic recovery round which applies that tally rule. The cheatsheet and sketch below summarize the core protocol; the later subsections cover practical refinements and optimizations.

Here is the corresponding Fast CASPaxos protocol summary from the paper:

<figure class="protocol-cheatsheet"> <div class="protocol-cheatsheet-columns"> <div class="protocol-cheatsheet-column"> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/fast-caspaxos-state.png" data-lightbox data-lightbox-caption="Fast CASPaxos cheatsheet: State" aria-label="Expand Fast CASPaxos cheatsheet state box" > <img src="/images/protocol-cheatsheets/fast-caspaxos-state.png" alt="Fast CASPaxos cheatsheet box titled State" loading="lazy" /> </a> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/fast-caspaxos-rules-for-proposers.png" data-lightbox data-lightbox-caption="Fast CASPaxos cheatsheet: Rules for Proposers" aria-label="Expand Fast CASPaxos cheatsheet rules for proposers box" > <img src="/images/protocol-cheatsheets/fast-caspaxos-rules-for-proposers.png" alt="Fast CASPaxos cheatsheet box titled Rules for Proposers" loading="lazy" /> </a> </div> <div class="protocol-cheatsheet-column"> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/fast-caspaxos-ballots.png" data-lightbox data-lightbox-caption="Fast CASPaxos cheatsheet: Ballots" aria-label="Expand Fast CASPaxos cheatsheet ballots box" > <img src="/images/protocol-cheatsheets/fast-caspaxos-ballots.png" alt="Fast CASPaxos cheatsheet box titled Ballots" loading="lazy" /> </a> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/fast-caspaxos-prepare-rpc.png" data-lightbox data-lightbox-caption="Fast CASPaxos cheatsheet: Prepare RPC" aria-label="Expand Fast CASPaxos cheatsheet prepare RPC box" > <img src="/images/protocol-cheatsheets/fast-caspaxos-prepare-rpc.png" alt="Fast CASPaxos cheatsheet box titled Prepare RPC" loading="lazy" /> </a> <a class="protocol-cheatsheet-card" href="/images/protocol-cheatsheets/fast-caspaxos-accept-rpc.png" data-lightbox data-lightbox-caption="Fast CASPaxos cheatsheet: Accept RPC" aria-label="Expand Fast CASPaxos cheatsheet accept RPC box" > <img src="/images/protocol-cheatsheets/fast-caspaxos-accept-rpc.png" alt="Fast CASPaxos cheatsheet box titled Accept RPC" loading="lazy" /> </a> </div> </div> <figcaption> Fast CASPaxos cheatsheet </figcaption> </figure>

Protocol sketch

This is simplified to show the core protocol only. It omits duplicate-message handling, cached local views, and the best-effort learn path.

Notation:

ballots are (round, proposerId), where proposerId = 0 denotes the shared fast ballot for that round
prepareQuorum = classicQuorum = floor(N / 2) + 1
fastQuorum = ceil(3N / 4)
acceptQuorumFor(ballot) = fastQuorum for fast ballots and classicQuorum for classic ballots
every prepare response returns (acceptedBallot, acceptedValue, maxBallot)
prepare(b) succeeds at an acceptor exactly when the response reports maxBallot = b

Proposer

function propose(update):
    b := initial ballot to try

    loop:
        if b is not already prepared:
            responses := collect Prepare(b) responses
            if responses with maxBallot == b do not reach prepareQuorum:
                b := next classic ballot above the highest maxBallot returned
                continue

            currentValue := recoverValue(responses)
        else:
            currentValue := previously learned value for b

        newValue := update(currentValue)
        next := optional next ballot to piggyback

        responses := collect Accept(b, newValue, next) responses
        if accepts in responses reach acceptQuorumFor(b):
            return newValue

        b := next classic ballot above the highest conflicting ballot returned

function recoverValue(responses):
    highest := maximum acceptedBallot reported in the responses

    if highest == ⊥:
        return ⊥

    if highest is a classic ballot:
        return the value paired with highest

    // highest is a fast ballot, so several values may appear at that ballot
    votes := count values among responses whose acceptedBallot == highest
    return any value tied for the largest count

Notes:

prepare always needs only a classic majority. The larger quorum is only for fast accept.
A proposer only reuses next if enough successful accept responses confirm that the promise really landed.
In practice, a proposer only skips prepare when it already has a prepared ballot and enough local knowledge to compute newValue.

Acceptor

state:
    promisedBallot := (1, 0)       // initial fast ballot is implicitly prepared
    acceptedBallot := ⊥
    acceptedValue := ⊥

function onPrepare(b):
    if b >= promisedBallot:
        promisedBallot := b
    reply prepareResponse(acceptedBallot, acceptedValue, promisedBallot)

function onAccept(b, v, next):
    maxBallot := max(promisedBallot, acceptedBallot)
    if b < maxBallot:
        reply reject(maxBallot)
        return

    // Optimization: idempotency from possibly multiple proposers
    if b == acceptedBallot and v == acceptedValue:
        reply accept(promisedBallot)
        return

    if b == acceptedBallot and v != acceptedValue:
        reply reject(maxBallot)
        return

    acceptedBallot := b
    acceptedValue := v

    if next != none:
        promisedBallot := max(promisedBallot, next)

    reply accept(promisedBallot)

Concurrently committing identical values

First-write-wins only rules out different values at the same fast ballot. To avoid treating identical concurrent proposals as conflicts, acceptors still acknowledge an accept request which exactly matches their current accepted ballot and value. That keeps almost-everywhere-agreement workloads efficient without changing state.

In classic rounds, the same behavior also makes accept retries idempotent.

Register initialization

To allow proposers to initialize a register in 1 RTT, they start with the accept phase for the initial fast round, (1, 0). This way, either they can have their initial value committed if there is no contention, or they will learn of the conflicting ballot and can retry at the prepare phase. Acceptors are initialized with a promised ballot of (1, 0).

Reusing a piggybacked ballot safely

Piggybacking a next ballot is only a latency optimization. A proposer may skip a standalone prepare on its next operation only if a quorum of successful accept responses confirms that the requested next ballot was actually promised. If only some acceptors echoed it, the proposer discards that candidate ballot and falls back to prepare.

Learning decided values

A proposer may send Accept(b, v, next) only if it already knows the register value for b. It can learn that value by running prepare, by committing and reusing the piggybacked next ballot, or via a best-effort learn notification from another proposer.

Learn traffic is only an optimization. If that notification is stale or missing, the proposer falls back to prepare.

Do we even need classic rounds?

Fast CASPaxos could be generalized further to only use fast rounds, and never fall back to classic rounds. Distributed consensus revised contains the necessary relaxations/generalizations. On the other hand, having distinct leadered vs leaderless rounds confers higher fault tolerance and it's arguably less of a leap for people familiar with Classic Paxos already.

Conclusion

Fast CASPaxos is a small extension to CASPaxos that implements a leaderless linearizable register. It's conceptually a blend of Fast Paxos and CASPaxos. It's likely most useful for consistent group membership (eg, Rapid) and metadata replication (eg, Delos and other Consensus for Metadata systems (which is a lot, btw)). I'm happy to have scratched this itch that has been bugging me for a long time.

I uploaded a draft PDF of a paper on Fast CASPaxos. The accompanying repository also includes a TLA+ model checked with TLC, a deterministic simulation suite with Porcupine linearizability checking, and some toy benchmark workloads for various scenarios topologies. When reading the code, note that while I spent time on the core (proposer, acceptor, types, etc), a lot of what surrounds it is generated by coding agents, especially the benchmark harness.

reublog

Deploying Wyam To GitHub Using Visual Studio Online

Prerequisites

Kick-starting Wyam with Cake

Automating Deployment

Code Generation on .NET

Kinds of Code Generation

Expression Trees

IL Generation

Syntax Generation

Orleans

General Strategy

Conclusion

.NET IL Generation - Writing DeepCopy

IL Generation

DeepCopy

Battle Plan

Diving Into The Code

Interlude - What IL Should We Emit?

Back To Emit

Performance Tuning for .NET Core

Maximize profitable inlining

Use static throw helpers

Minimize virtual/interface calls

Reduce allocations

Pool buffers/objects

Avoid boxing

Reduce closure allocations

Minimize copying

Use Span<T> to avoid array allocations and avoid data copying

Pass structs by ref to minimize on-stack copies

Avoid defensive copies

Reduce branching & branch misprediction

Other miscellaneous tips

Library-specific optimizations

Optimize generated code

Avoid virtual calls by skipping codec lookup for well-known types

(Unimplemented) Specialize generic types at runtime

Pre-compute constant values to eliminate some branching

Choose appropriate data structures

Choose appropriate algorithms

CASPaxos

CASPaxos

Simplicity

Storage Performance

Coordination

Challenges

Server Catch-up

Membership Change

Scale-out

Large Values

Conclusion

Fast CASPaxos

Introduction

Leader-based vs leaderless consensus

When each mode fits best

Leadered

Leaderless

When it's not clear-cut

CASPaxos

Role-local state

Proposer state (volatile)

Acceptor state (persistent)

Phase 1: prepare

Phase 2: accept

Fast CASPaxos

Protocol sketch

Proposer

Acceptor

Concurrently committing identical values

Register initialization

Reusing a piggybacked ballot safely

Learning decided values

Do we even need classic rounds?

Conclusion

Use `Span<T>` to avoid array allocations and avoid data copying

Pass structs by `ref` to minimize on-stack copies