<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="/rss.xsl" type="text/xsl"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>reublog</title><description>Notes on distributed systems, .NET, and software.</description><link>https://reubenbond.github.io</link><item><title>Deploying Wyam To GitHub Using Visual Studio Online</title><link>https://reubenbond.github.io/posts/-setting-up-wyam</link><guid isPermaLink="true">https://reubenbond.github.io/posts/-setting-up-wyam</guid><description>Here goes nothing! This blog is built with Dave Glick&apos;s Wyam static site generator and deployed from a git repo in Visual Studio Online to GitHub Pages. Here&apos;s how to set up something similar.</description><pubDate>Tue, 03 Oct 2017 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Here goes nothing! This blog is built with &lt;a href=&quot;https://twitter.com/daveaglick&quot;&gt;Dave Glick&apos;s&lt;/a&gt; &lt;a href=&quot;https://wyam.io/&quot;&gt;Wyam&lt;/a&gt; static site generator and deployed from a git repo in Visual Studio Online to GitHub Pages. Here&apos;s how to set up something similar.&lt;/p&gt;
&lt;h1&gt;Prerequisites&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;A Visual Studio Online repository for your blog source.
&lt;ul&gt;
&lt;li&gt;You could have also VSO pull the source from GitHub or somewhere else instead, but I haven&apos;t covered that here.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;A GitHub repository which will serve the compiled output via GitHub Pages.
&lt;ul&gt;
&lt;li&gt;I created a repository called &lt;a href=&quot;https://github.com/ReubenBond/reubenbond.github.io&quot;&gt;&lt;code&gt;reubenbond.github.io&lt;/code&gt;&lt;/a&gt; under my profile, &lt;a href=&quot;https://github.com/ReubenBond/&quot;&gt;&lt;code&gt;ReubenBond&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Cake so you can test it out locally. Install it via &lt;a href=&quot;https://chocolatey.org/&quot;&gt;Chocolatey&lt;/a&gt;: &lt;code&gt;choco install cake.portable&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Kick-starting Wyam with Cake&lt;/h1&gt;
&lt;p&gt;Create a file called &lt;code&gt;build.cake&lt;/code&gt; in the root of your repo with these contents:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#tool nuget:?package=Wyam
#addin nuget:?package=Cake.Wyam

var target = Argument(&quot;target&quot;, &quot;Default&quot;);

Task(&quot;Build&quot;)
   .Does(() =&amp;gt;
   {
       Wyam(new WyamSettings
       {
           Recipe = &quot;Blog&quot;,
           Theme = &quot;CleanBlog&quot;,
           UpdatePackages = true
       });
   });
   
Task(&quot;Preview&quot;)
   .Does(() =&amp;gt;
   {
       Wyam(new WyamSettings
       {
           Recipe = &quot;Blog&quot;,
           Theme = &quot;CleanBlog&quot;,
           UpdatePackages = true,
           Preview = true,
           Watch = true
       });        
   });

Task(&quot;Default&quot;)
   .IsDependentOn(&quot;Build&quot;);    
   
RunTarget(target);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Add a file called &lt;code&gt;config.wyam&lt;/code&gt; like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#recipe Blog
#theme CleanBlog

Settings[Keys.Host] = &quot;yourname.github.io&quot;;
Settings[BlogKeys.Title] = &quot;MegaBlog&quot;;
Settings[BlogKeys.Description] = &quot;Blog of the Gods&quot;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a folder called &lt;code&gt;input&lt;/code&gt; and add a folder called &lt;code&gt;posts&lt;/code&gt; inside that.
Now create &lt;code&gt;input/posts/fist-post.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Title: Fist Post! A song of fice and ire
Published: 10/30/2017
Tags: [&apos;Fists&apos;]
---

This post is about fists and how clumpy they always are.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Great! Try running it using Cake. Because Wyam targets an older version of Cake at the time of writing, I&apos;m adding the &lt;code&gt;--settings_skipverification=true&lt;/code&gt; option so that Cake doesn&apos;t complain.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cake --settings_skipverification=true -target=Preview
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Open a browser to http://localhost:5080 and see the results. The &lt;code&gt;Preview&lt;/code&gt; target watches for file changes so it can automatically recompile &amp;amp; refresh your browser whenever you save changes.&lt;/p&gt;
&lt;h1&gt;Automating Deployment&lt;/h1&gt;
&lt;ol&gt;
&lt;li&gt;Install the &lt;a href=&quot;https://marketplace.visualstudio.com/items?itemName=cake-build.cake&quot;&gt;Cake build task from the Visual Studio Marketplace&lt;/a&gt; into VSO.&lt;/li&gt;
&lt;li&gt;In Visual Studio Online, create a new, empty build for your repo, selecting an appropriate build agent.&lt;/li&gt;
&lt;li&gt;Add the Cake Build task.&lt;/li&gt;
&lt;li&gt;Select the &lt;code&gt;build.cake&lt;/code&gt; file from the root of your repo as the &lt;em&gt;Cake Script&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Set the &lt;em&gt;Target&lt;/em&gt; to &lt;code&gt;Default&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Optionally add the &lt;code&gt;--settings_skipverification=true&lt;/code&gt; option to &lt;em&gt;Cake Arguments&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Add a new &lt;em&gt;PowerShell Script&lt;/em&gt; build task, set &lt;em&gt;Type&lt;/em&gt; to &lt;code&gt;Inline Script&lt;/code&gt; and add these contents:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;param (
  [string]$Token,
  [string]$UserName,
  [string]$Repository
)

$localFolder = &quot;gh-pages&quot;
$repo = &quot;https://$($UserName):$($Token)@github.com/$($Repository).git&quot;
git clone $repo --branch=master $localFolder

Copy-Item &quot;output\*&quot; $localFolder -recurse

Set-Location $localFolder
git add *
git commit -m &quot;Update.&quot;
git push
&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;Create a new GitHub Personal Access token from GitHub&apos;s Developer Settings page, or by &lt;a href=&quot;https://github.com/settings/tokens/new&quot;&gt;clicking here&lt;/a&gt;. I added all of the &lt;code&gt;repo&lt;/code&gt; permissions to the token.&lt;/li&gt;
&lt;li&gt;In VSO, add arguments for the script, replacing &lt;code&gt;TOKEN&lt;/code&gt; with your token and replacing the other values as appropriate:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;-Token TOKEN -UserName &quot;ReubenBond&quot; -Repository &quot;ReubenBond/reubenbond.github.io&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;Up on the &lt;em&gt;Triggers&lt;/em&gt; pane, enable Continuous Integration.&lt;/li&gt;
&lt;li&gt;Click &lt;em&gt;Save &amp;amp; queue&lt;/em&gt;, then cross your fingers.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Hopefully that&apos;s it and you can now add new blog posts to the &lt;code&gt;input/posts&lt;/code&gt; directory.&lt;/p&gt;
</content:encoded><author>Reuben Bond</author></item><item><title>Code Generation on .NET</title><link>https://reubenbond.github.io/posts/codegen-1</link><guid isPermaLink="true">https://reubenbond.github.io/posts/codegen-1</guid><description>A brief overview of code generation APIs in .NET</description><pubDate>Wed, 01 Nov 2017 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;This is the first part in what&apos;s hopefully a series of short posts covering code generation on the .NET platform.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Almost every .NET application relies on code generation in some form, usually because they rely on a library which generates code as a part of how it functions. Eg, Json.NET &lt;a href=&quot;https://github.com/JamesNK/Newtonsoft.Json/blob/473a7721bd67cca8fef1ecc37da1951a1c180022/Src/Newtonsoft.Json/Utilities/DynamicReflectionDelegateFactory.cs&quot;&gt;leverages code generation&lt;/a&gt; and so does &lt;a href=&quot;https://github.com/aspnet/MvcPrecompilation&quot;&gt;ASP.NET&lt;/a&gt;, Entity Framework, &lt;a href=&quot;https://github.com/dotnet/orleans&quot;&gt;Orleans&lt;/a&gt;, most serialization libraries, many dependency injection libraries, and probably every test mocking library.&lt;/p&gt;
&lt;p&gt;Let&apos;s skip past &lt;em&gt;why&lt;/em&gt; code generation is useful and jump straight into a high level overview of code generation technologies for .NET.&lt;/p&gt;
&lt;h2&gt;Kinds of Code Generation&lt;/h2&gt;
&lt;p&gt;The 3 code gen methods for .NET which we&apos;ll discuss are: &lt;strong&gt;Expression Trees&lt;/strong&gt;, &lt;strong&gt;IL Generation&lt;/strong&gt;, and &lt;strong&gt;Syntax Generation&lt;/strong&gt;. There are other methods, such as text templating (eg using T4). Here are the pros and cons of each as I see them.&lt;/p&gt;
&lt;h3&gt;Expression Trees&lt;/h3&gt;
&lt;p&gt;Using &lt;strong&gt;LINQ Expression Trees&lt;/strong&gt; to compile expressions at runtime.&lt;/p&gt;
&lt;p&gt;:::tip
Easy to use, expressive, and often the most approachable place to start when you need runtime code generation.
:::&lt;/p&gt;
&lt;p&gt;:::caution
Expression trees are interpreted on AOT-only platforms like iOS, and some language constructs simply are not available.
:::&lt;/p&gt;
&lt;h3&gt;IL Generation&lt;/h3&gt;
&lt;p&gt;Using &lt;strong&gt;Reflection.Emit&lt;/strong&gt; to dynamically create types and methods using Common Intermediate Langage (known as CIL or just IL), which is the assembly language of the CLR.&lt;/p&gt;
&lt;p&gt;:::tip
IL generation can produce code which cannot be expressed in C#, such as direct access to private members.
:::&lt;/p&gt;
&lt;p&gt;:::warning
The trade-off is ergonomics: IL is verbose, awkward to debug, difficult to use for higher-level features like &lt;code&gt;async&lt;/code&gt;/&lt;code&gt;await&lt;/code&gt;, and unavailable on AOT-only platforms.
:::&lt;/p&gt;
&lt;h3&gt;Syntax Generation&lt;/h3&gt;
&lt;p&gt;Using &lt;strong&gt;Roslyn&lt;/strong&gt; or some other API to generate C# syntax trees or source code and compile it either at runtime or when the target project is built.&lt;/p&gt;
&lt;p&gt;:::tip
Syntax generation gives you direct access to the full C# language and works well on AOT-only platforms because the output is plain source code.
:::&lt;/p&gt;
&lt;p&gt;:::note
The API can feel indirect because it was designed for parsing and compilation first, not authoring, and runtime scenarios mean shipping Roslyn with your app.
:::&lt;/p&gt;
&lt;h2&gt;Orleans&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/dotnet/orleans&quot;&gt;Microsoft Orleans&lt;/a&gt; uses the latter two approaches: IL and Roslyn. It uses Roslyn wherever possible, since it allows for easy access to C# language features like &lt;code&gt;async&lt;/code&gt; and since it&apos;s easy to comprehend both the code generator and the generated code. Otherwise, IL generation is used for two things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Generating code at runtime. For example &lt;a href=&quot;https://github.com/dotnet/orleans/blob/375a98191ca40c27ca8ed61199a6a77a7995e75e/src/Orleans.Core/Serialization/ILSerializerGenerator.cs&quot;&gt;&lt;code&gt;ILSerializerGenerator&lt;/code&gt;&lt;/a&gt; generates serializers as a last resort for types which C# serializers couldn&apos;t be generated for (for example, private inner classes). It&apos;s a faster and less restricted alternative to .NET&apos;s &lt;a href=&quot;https://msdn.microsoft.com/en-us/library/system.runtime.serialization.formatters.binary.binaryformatter(v=vs.110).aspx&quot;&gt;&lt;code&gt;BinaryFormatter&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Producing code which cannot be expressed in C#. For example, &lt;a href=&quot;https://github.com/dotnet/orleans/blob/375a98191ca40c27ca8ed61199a6a77a7995e75e/src/Orleans.Core.Abstractions/Serialization/FieldUtils.cs#&quot;&gt;&lt;code&gt;FieldUtils&lt;/code&gt;&lt;/a&gt; provides access to private fields and methods for serialization.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;General Strategy&lt;/h2&gt;
&lt;p&gt;Regardless of which technology a library makes use of, code generation typically involves two phases:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Metadata Collection
&lt;ul&gt;
&lt;li&gt;The code generator takes some input and creates an abstract representation of it in order to drive the code synthesis process.&lt;/li&gt;
&lt;li&gt;Eg, a library for deeply cloning objects might take a &lt;code&gt;Type&lt;/code&gt; as input and generate an object describing each field in that type.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Code Synthesis
&lt;ul&gt;
&lt;li&gt;The code generator uses the metadata model to drive the process of actually generating code (LINQ expressions, IL instructions, syntax tree nodes).&lt;/li&gt;
&lt;li&gt;Eg, our deep cloning library will generate a method which takes an object of the specified type from the metadata model and then recursively copy each of the fields.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The two phases can be merged for simple code generators. Orleans uses two phases. In phase 1, the input assembly is scanned and metadata is collected for types matching various criteria: Grain classes, Grain interfaces, serializable types, and custom serializer registrations. In phase 2, support classes are generated. For example, each grain interface has two classes generated: an RPC proxy and an RPC stub.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;That&apos;s enough for now. Maybe next time we&apos;ll take a look at writing that hypothetical deep cloning library using IL generation. After that, we can take a look at a serialization library I&apos;ve been working on which uses Roslyn for both metadata collection and syntax generation. If either of those things are interesting to you, let me know here or on &lt;a href=&quot;https://twitter.com/reubenbond&quot;&gt;Twitter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;:::important
If IL generation is the piece you want to see in practice, the next post walks through a deep-copy implementation step by step.
:::&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/posts/codegen-2-il-boogaloo&quot;&gt;&lt;strong&gt;Next Post: .NET IL Generation - Writing DeepCopy&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Reuben Bond</author></item><item><title>.NET IL Generation - Writing DeepCopy</title><link>https://reubenbond.github.io/posts/codegen-2-il-boogaloo</link><guid isPermaLink="true">https://reubenbond.github.io/posts/codegen-2-il-boogaloo</guid><description>Implementing a powerful object cloning library using IL generation.</description><pubDate>Sat, 04 Nov 2017 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;This is the second part in a series of short posts covering code generation on the .NET platform.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;IL Generation&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;/posts/codegen-1&quot;&gt;Last time&lt;/a&gt;, we skimmed over some methods to generate code on .NET and one of them was emitting IL. IL generation lets us circumvent the rules C# and other languages put in place to protect us from our own stupidity. Without those rules, we can implement all kinds of fancy foot guns. Rules like “don&apos;t access private members of foreign types” and “don&apos;t modify &lt;code&gt;readonly&lt;/code&gt; fields”. That last one is interesting: C#&apos;s &lt;code&gt;readonly&lt;/code&gt; translates into &lt;code&gt;initonly&lt;/code&gt; on the IL/metadata level so theoretically we shouldn&apos;t be able to modify those fields even using IL. As a matter of fact we can, but it comes at a cost: &lt;strong&gt;our IL will no longer be verifiable&lt;/strong&gt;. That means that certain tools will bark at you if you try to write IL code which commits this sin, tools such as &lt;a href=&quot;https://docs.microsoft.com/en-us/dotnet/framework/tools/peverify-exe-peverify-tool&quot;&gt;PEVerify&lt;/a&gt; and &lt;a href=&quot;https://github.com/dotnet/corert/tree/master/src/ILVerify&quot;&gt;ILVerify&lt;/a&gt;. Verifiable code also has ramifications for &lt;a href=&quot;https://docs.microsoft.com/en-us/dotnet/framework/misc/security-transparent-code&quot;&gt;Security-Transparent Code&lt;/a&gt;. Thankfully for us, Code Access Security and Security Transparent Code &lt;a href=&quot;https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/porting.md#code-access-security-cas&quot;&gt;don&apos;t exist in .NET Core&lt;/a&gt; and they usually don&apos;t cause issue for .NET Framework.&lt;/p&gt;
&lt;p&gt;Enough stalling, onto our mission briefing.&lt;/p&gt;
&lt;h3&gt;DeepCopy&lt;/h3&gt;
&lt;p&gt;Today we&apos;re going to implement the guts of a library for creating deep copies of objects. Essentially it provides one method:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;public static T Copy&amp;lt;T&amp;gt;(T original);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Our library will be called &lt;em&gt;DeepCopy&lt;/em&gt; and the source is up on GitHub at &lt;a href=&quot;https://github.com/ReubenBond/DeepCopy&quot;&gt;ReubenBond/DeepCopy&lt;/a&gt; feel free to mess about with it. The majority of the code was adapted from the &lt;a href=&quot;https://github.com/dotnet/orleans&quot;&gt;Orleans&lt;/a&gt; codebase.&lt;/p&gt;
&lt;p&gt;Deep copying is important for frameworks such as &lt;a href=&quot;https://github.com/dotnet/orleans&quot;&gt;Orleans&lt;/a&gt;, since it allows us to safely send mutable objects between grains on the same node without having to first serialize &amp;amp; then deserialze them, among other things. Of course, immutable objects (such as strings) are shared without copying. Oddly enough, serializing then deserializing an object is the &lt;a href=&quot;https://stackoverflow.com/a/78612/635314&quot;&gt;accepted Stack Overflow answer&lt;/a&gt; to the question of “how can I deep copy an object?”.&lt;/p&gt;
&lt;p&gt;Let&apos;s see if we can fix that.&lt;/p&gt;
&lt;h3&gt;Battle Plan&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;Copy&lt;/code&gt; method will recursively copy every field in the input object into a new instance of the same type. It must be able to deal with multiple references to the same object, so that if the user provides an object which contains a reference to itself then the result will also contain a reference to itself. That means we&apos;ll need to perform reference tracking. That&apos;s easy to do: we maintain a &lt;code&gt;Dictionary&amp;lt;object, object&amp;gt;&lt;/code&gt; which maps from original object to copy object. Our main &lt;code&gt;Copy&amp;lt;T&amp;gt;(T orig)&lt;/code&gt; method will call into a helper method with that dictionary as a parameter:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;public static T Copy&amp;lt;T&amp;gt;(T original, CopyContext context)
{
  /* TODO: implementation */
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The copy routine is roughly as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If the input is &lt;code&gt;null&lt;/code&gt;, return &lt;code&gt;null&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If the input has already been copied (or is currently being copied), return its copy.&lt;/li&gt;
&lt;li&gt;If the input is &apos;immutable&apos;, return the input.&lt;/li&gt;
&lt;li&gt;If the input is an array, copy each element into a new array and return it.&lt;/li&gt;
&lt;li&gt;Create a new instance of the input type and recursively copy each field from the input to the output and return it.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Our definition of immutable is simple: the type is either a primitive or it&apos;s marked using a special &lt;code&gt;[Immutable]&lt;/code&gt; attribute. More elaborate immutability could be probably be soundly implemented, so &lt;a href=&quot;https://github.com/ReubenBond/DeepCopy/pull/new/master&quot;&gt;submit a PR&lt;/a&gt; if you&apos;ve improved upon it.&lt;/p&gt;
&lt;p&gt;Everything but the last step in our routine is simple enough to do without generating code. The last step, recursively copying each field, can be performed using reflection to get and set field values. Reflection is a real performance killer on the hot path, though, and so we&apos;re going to go our own route using IL.&lt;/p&gt;
&lt;h3&gt;Diving Into The Code&lt;/h3&gt;
&lt;p&gt;The main IL generation in &lt;em&gt;DeepCopy&lt;/em&gt; occurs inside &lt;a href=&quot;https://github.com/ReubenBond/DeepCopy/blob/1b00515b6b6aece93b4bea61bf40780265c2e349/src/DeepCopy/CopierGenerator.cs#L52&quot;&gt;&lt;code&gt;CopierGenerator.cs&lt;/code&gt;&lt;/a&gt; in the &lt;code&gt;CreateCopier&amp;lt;T&amp;gt;(Type type)&lt;/code&gt; method. Let&apos;s walk through it:&lt;/p&gt;
&lt;p&gt;First we create a new &lt;code&gt;DynamicMethod&lt;/code&gt; which will hold the IL code we emit. We have to tell &lt;code&gt;DynamicMethod&lt;/code&gt; what the signature of the type we&apos;re creating is. In our case, it&apos;s a generic delegate type, &lt;code&gt;delegate T DeepCopyDelegate&amp;lt;T&amp;gt;(T original, CopyContext context)&lt;/code&gt;. Then we get the &lt;code&gt;ILGenerator&lt;/code&gt; for the method so that we can begin emitting IL code to it.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;var dynamicMethod = new DynamicMethod(
    type.Name + &quot;DeepCopier&quot;,
    typeof(T), // The return type of the delegate
    new[] {typeof(T), typeof(CopyContext)}, // The parameter types of the delegate.
    typeof(CopierGenerator).Module,
    true);

var il = dynamicMethod.GetILGenerator(); 
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The IL is going to be rather complicated because it needs to deal with immutable types and value types, but let&apos;s walk through it bit-by-bit.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// Declare a variable to store the result.
il.DeclareLocal(type);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next we need to initialize our new local variable to a new instance of the input type. There are 3 cases to consider, each corresponding to a block in the following code:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The type is a value type (struct). Initialize it by essentially using a &lt;code&gt;default(T)&lt;/code&gt; expression.&lt;/li&gt;
&lt;li&gt;The type has a parameterless constructor. Initialize it by calling &lt;code&gt;new T()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The type does not have a parameterless constructor. In this case we ask the framework for help and we call &lt;code&gt;FormatterServices.GetUninitializedObject(type)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;// Construct the result.
var constructorInfo = type.GetConstructor(Type.EmptyTypes);
if (type.IsValueType)
{
    // Value types can be initialized directly.
    // C#: result = default(T);
    il.Emit(OpCodes.Ldloca_S, (byte)0);
    il.Emit(OpCodes.Initobj, type);
}
else if (constructorInfo != null)
{
    // If a default constructor exists, use that.
    // C#: result = new T();
    il.Emit(OpCodes.Newobj, constructorInfo);
    il.Emit(OpCodes.Stloc_0);
}
else
{
    // If no default constructor exists, create an instance using GetUninitializedObject
    // C#: result = (T)FormatterServices.GetUninitializedObject(type);
    var field = this.fieldBuilder.GetOrCreateStaticField(type);
    il.Emit(OpCodes.Ldsfld, field);
    il.Emit(OpCodes.Call, this.methodInfos.GetUninitializedObject);
    il.Emit(OpCodes.Castclass, type);
    il.Emit(OpCodes.Stloc_0);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Interlude - What IL Should We Emit?&lt;/h3&gt;
&lt;p&gt;Even if you&apos;re not a first-timer with IL, it&apos;s not always easy to work out what IL you need to emit to achieve the desired result. This is where tools come in to help you. Personally I typically write my code in C# first, slap it into &lt;a href=&quot;https://www.linqpad.net/&quot;&gt;LINQPad&lt;/a&gt;, hit run and open the IL tab in the output. It&apos;s great for experimenting.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/linqpad-il.png&quot; alt=&quot;LINQPad is seriously handy!&quot; title=&quot;LINQPad makes quick experiments with generated IL easy to inspect&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Another option is to use a decompiler/disassembler like &lt;a href=&quot;https://www.jetbrains.com/decompiler/&quot;&gt;JetBrains&apos; dotPeek&lt;/a&gt;. You would compile your assembly and open it in dotPeek to reveal the IL.&lt;/p&gt;
&lt;p&gt;Finally, if you&apos;re like me, then &lt;a href=&quot;https://www.jetbrains.com/resharper/&quot;&gt;ReSharper&lt;/a&gt; is indispensible. It&apos;s like coding on rails (train tracks, not Ruby). ReSharper comes with a convenient &lt;a href=&quot;https://www.jetbrains.com/help/resharper/Viewing_Intermediate_Language.html&quot;&gt;IL Viewer&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/resharper-il.png&quot; alt=&quot;ReSharper IL Viewer&quot; title=&quot;ReSharper helps inspect the IL produced by a compiled assembly&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Alright, so that&apos;s how you work out what IL to generate. You&apos;ll occasionally want to &lt;a href=&quot;https://msdn.microsoft.com/en-us/library/system.reflection.emit.opcodes(v=vs.110).aspx&quot;&gt;visit the docs&lt;/a&gt;, too.&lt;/p&gt;
&lt;h3&gt;Back To Emit&lt;/h3&gt;
&lt;p&gt;Now we have a new instance of the input type stored in our local result variable. Before we do anything else, we must record the newly created reference. We push each argument onto the stack in order and use the non-virtual &lt;code&gt;Call&lt;/code&gt; op-code to invoke &lt;code&gt;context.RecordObject(original, result)&lt;/code&gt;. We can use the non-virtual &lt;code&gt;Call&lt;/code&gt; op-code to call &lt;code&gt;CopyContext.RecordObject&lt;/code&gt; because &lt;code&gt;CopyContext&lt;/code&gt; is a &lt;code&gt;sealed&lt;/code&gt; class. If it wasn&apos;t, we would use &lt;code&gt;Callvirt&lt;/code&gt; instead.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// An instance of a value types can never appear multiple times in an object graph,
// so only record reference types in the context.
if (!type.IsValueType)
{
    // Record the object.
    // C#: context.RecordObject(original, result);
    il.Emit(OpCodes.Ldarg_1); // context
    il.Emit(OpCodes.Ldarg_0); // original
    il.Emit(OpCodes.Ldloc_0); // result, i.e, the copy of original
    il.Emit(OpCodes.Call, this.methodInfos.RecordObject);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On to the meat of our generator! With the accounting out of the way, we can enumerate over each field and generate code to copy each one into our &lt;code&gt;result&lt;/code&gt; variable. The comments narrate the process:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// Copy each field.
foreach (var field in this.copyPolicy.GetCopyableFields(type))
{
    // Load a reference to the result.
    if (type.IsValueType)
    {
        // Value types need to be loaded by address rather than copied onto the stack.
        il.Emit(OpCodes.Ldloca_S, (byte)0);
    }
    else
    {
        il.Emit(OpCodes.Ldloc_0);
    }

    // Load the field from the result.
    il.Emit(OpCodes.Ldarg_0);
    il.Emit(OpCodes.Ldfld, field);

    // Deep-copy the field if needed, otherwise just leave it as-is.
    if (!this.copyPolicy.IsShallowCopyable(field.FieldType))
    {
        // Copy the field using the generic Copy&amp;lt;T&amp;gt; method.
        // C#: Copy&amp;lt;T&amp;gt;(field)
        il.Emit(OpCodes.Ldarg_1);
        il.Emit(OpCodes.Call, this.methodInfos.CopyInner.MakeGenericMethod(field.FieldType));
    }

    // Store the copy of the field on the result.
    il.Emit(OpCodes.Stfld, field);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Return the result and build our delegate using &lt;code&gt;CreateDelegate&lt;/code&gt; so that we can start using it immediately.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// C#: return result;
il.Emit(OpCodes.Ldloc_0);
il.Emit(OpCodes.Ret);
return dynamicMethod.CreateDelegate(typeof(DeepCopyDelegate&amp;lt;T&amp;gt;)) as DeepCopyDelegate&amp;lt;T&amp;gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That&apos;s the guts of the library. Of course many details were left out, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Caching &lt;code&gt;Type&lt;/code&gt; values in static fields so that we can reference them from our generated code. See &lt;a href=&quot;https://github.com/ReubenBond/DeepCopy/blob/1b00515b6b6aece93b4bea61bf40780265c2e349/src/DeepCopy/StaticFieldBuilder.cs#L64&quot;&gt;&lt;code&gt;StaticFieldBuilder.cs&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The special handling of arrays in &lt;a href=&quot;https://github.com/ReubenBond/DeepCopy/blob/1b00515b6b6aece93b4bea61bf40780265c2e349/src/DeepCopy/DeepCopier.cs#L69&quot;&gt;&lt;code&gt;DeepCopier.cs&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Optimizations such as using &lt;a href=&quot;https://github.com/ReubenBond/DeepCopy/blob/master/src/DeepCopy/CachedReadConcurrentDictionary.cs&quot;&gt;&lt;code&gt;CachedReadConcurrentDictionary&amp;lt;TKey, TValue&amp;gt;&lt;/code&gt;&lt;/a&gt; for a slight improvement over &lt;code&gt;ConcurrentDictionary&amp;lt;TKey, TValue&amp;gt;&lt;/code&gt; for workloads with a diminishing write volume.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Reuben Bond</author></item><item><title>Performance Tuning for .NET Core</title><link>https://reubenbond.github.io/posts/dotnet-perf-tuning</link><guid isPermaLink="true">https://reubenbond.github.io/posts/dotnet-perf-tuning</guid><description>Some of you may know I&apos;ve been spending whatever time I can scrounge together grinding away at a new serialization library for .NET. Serializers can be complicated beasts. They have to be reliable, flexible, and fast beyond reproach. I won&apos;t convince you that serialization libraries have to be quick — in this post, that&apos;s a given. These are some tips from my experience in optimizing Hagar&apos;s performance. Most of this advice is applicable to other types of libraries or applications.</description><pubDate>Tue, 15 Jan 2019 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Some of you may know I&apos;ve been spending whatever time I can scrounge together grinding away at a new serialization library for .NET.
Serializers can be complicated beasts. They have to be reliable, flexible, and fast beyond reproach.
I won&apos;t convince you that serialization libraries have to be quick — in this post, that&apos;s a given. These are some tips from my experience in optimizing &lt;a href=&quot;https://github.com/ReubenBond/Hagar&quot;&gt;Hagar&lt;/a&gt;&apos;s performance. &lt;strong&gt;Most of this advice is applicable to other types of libraries or applications.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A post on performance should have minimal overhead and get straight to the point, so this post focuses on tips to help you and things to look out for. &lt;a href=&quot;https://twitter.com/reubenbond&quot;&gt;Message me on Twitter&lt;/a&gt; if something is unclear or you have something to add.&lt;/p&gt;
&lt;p&gt;:::note
This post is intentionally heuristic-heavy: each tip is aimed at hot paths where nanoseconds and allocations add up under load.
:::&lt;/p&gt;
&lt;h2&gt;Maximize profitable inlining&lt;/h2&gt;
&lt;p&gt;Inlining is the technique where a method body is copied to the call site so that we can avoid the cost of jumping, argument passing, and register saving/restoring. In addition to saving those costs, inlining is a requirement for other optimizations. Roslyn (C#&apos;s compiler) does not inline code. Instead, it is the responsibility of the JIT, as are most optimizations.&lt;/p&gt;
&lt;p&gt;:::tip
Inlining is rarely about saving the call itself. The bigger win is that once a method is inlined, the JIT can see more of the surrounding code and unlock other optimizations.
:::&lt;/p&gt;
&lt;h3&gt;Use static &lt;em&gt;throw helpers&lt;/em&gt;&lt;/h3&gt;
&lt;p&gt;A recent change which involved a significant refactor added around 20ns to the call duration for the serialization benchmark, increasing times from ~130ns to ~150ns (which is significant).&lt;/p&gt;
&lt;p&gt;The culprit was the &lt;code&gt;throw&lt;/code&gt; statement added in this helper method:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;public static Writer&amp;lt;TBufferWriter&amp;gt; CreateWriter&amp;lt;TBufferWriter&amp;gt;(
    this TBufferWriter buffer,
    SerializerSession session) where TBufferWriter : IBufferWriter&amp;lt;byte&amp;gt;
{
    if (session == null) throw new ArgumentNullException(nameof(session));
    return new Writer&amp;lt;TBufferWriter&amp;gt;(buffer, session);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When a method contains a &lt;code&gt;throw&lt;/code&gt; statement, the JIT will not inline it. The common trick to solve this is to add a static &quot;throw helper&quot; method to do the dirty work for you, so the end result looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;public static Writer&amp;lt;TBufferWriter&amp;gt; CreateWriter&amp;lt;TBufferWriter&amp;gt;(
    this TBufferWriter buffer,
    SerializerSession session) where TBufferWriter : IBufferWriter&amp;lt;byte&amp;gt;
{
    if (session == null) ThrowSessionNull();
    return new Writer&amp;lt;TBufferWriter&amp;gt;(buffer, session);

    void ThrowSessionNull() =&amp;gt; throw new ArgumentNullException(nameof(session));
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Crisis averted. The codebase uses this trick in many places. Having the &lt;code&gt;throw&lt;/code&gt; statement in a separate method may have other benefits such as improving the locality of your commonly used code paths, but I&apos;m unsure and haven&apos;t measured the impact.&lt;/p&gt;
&lt;h3&gt;Minimize virtual/interface calls&lt;/h3&gt;
&lt;p&gt;Virtual calls are slower than direct calls. If you&apos;re writing a performance critical system then there&apos;s a good chance you&apos;ll see virtual call overhead show up in the profiler. For one, virtual calls require indirection.&lt;/p&gt;
&lt;p&gt;Devirtualization is a feature of many JIT Compilers, and RyuJIT is no exception. It&apos;s a complicated feature, though, and there are not many cases where RyuJIT can currently &lt;em&gt;prove&lt;/em&gt; (to itself) that a method can be devirtualized and therefore become a candidate for inlining. Here are a couple of general tips for taking advantage of devirtualization, but I&apos;m sure there are more (so let me know if you have any).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Mark classes as &lt;code&gt;sealed&lt;/code&gt; by default. When a class/method is marked as &lt;code&gt;sealed&lt;/code&gt;, RyuJIT can take that into account and is likely able to inline a method call.&lt;/li&gt;
&lt;li&gt;Mark &lt;code&gt;override&lt;/code&gt; methods as &lt;code&gt;sealed&lt;/code&gt; if possible.&lt;/li&gt;
&lt;li&gt;Use concrete types instead of interfaces. Concrete types give the JIT more information, so it has a better chance of being able to inline your call.&lt;/li&gt;
&lt;li&gt;Instantiate and use non-sealed objects in the same method (rather than having a &apos;create&apos; method). RyuJIT can devirtualize non-sealed method calls when the type is definitely known, such as immediately after construction.&lt;/li&gt;
&lt;li&gt;Use generic type constraints for polymorphic types so that they can be specialized using a concrete type and interface calls can be devirtualized. In Hagar, our core writer type is defined as follows:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;public ref struct Writer&amp;lt;TBufferWriter&amp;gt; where TBufferWriter : IBufferWriter&amp;lt;byte&amp;gt;
{
    private TBufferWriter output;
    // --- etc ---
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All calls to methods on &lt;code&gt;output&lt;/code&gt; in the CIL which Roslyn emits will be preceded by a &lt;code&gt;constrained&lt;/code&gt; instruction which tells the JIT that instead of making a virtual/interface call, the call can be made to the precise method defined on &lt;code&gt;TBufferWriter&lt;/code&gt;. This helps with devirtualization. All calls to methods defined on &lt;code&gt;output&lt;/code&gt; are successfully devirtualized as a result. Here&apos;s &lt;a href=&quot;https://github.com/dotnet/coreclr/issues/9908&quot;&gt;a CoreCLR thread by Andy Ayers&lt;/a&gt; on the JIT team which details current and future work for devirtualization.&lt;/p&gt;
&lt;h2&gt;Reduce allocations&lt;/h2&gt;
&lt;p&gt;.NET&apos;s garbage collector is a remarkable piece of engineering. GC allows for algorithmic optimizations for some lock-free data structures and also removes whole classes of bugs and lightens the developer&apos;s cognitive load. All things considered, garbage collection is a &lt;em&gt;tremendously&lt;/em&gt; successful technique for memory management.&lt;/p&gt;
&lt;p&gt;However, while the GC is a powerful work horse, it helps to lighten its load not only because it means your application will pause for collection less often (and more generally, less CPU time will be devoted to GC work), but also because lightening working set is beneficial for cache locality.&lt;/p&gt;
&lt;p&gt;The rule-of-thumb for allocations is that they should either die in the first generation (Gen0) or live forever in the last (Gen2).&lt;/p&gt;
&lt;p&gt;:::important
A useful rule of thumb is that allocations should either die young in Gen0 or live long enough to justify promotion. The awkward middle is where GC overhead tends to hurt.
:::&lt;/p&gt;
&lt;p&gt;.NET uses a bump allocator where each thread allocates objects from its per-thread context by &apos;bumping&apos; a pointer. For this reason, better cache locality can be achieved for short-lived allocations when they are allocated and used on the same thread.&lt;/p&gt;
&lt;p&gt;For more info on .NET&apos;s GC, see &lt;a href=&quot;https://twitter.com/matthewwarren&quot;&gt;Matt Warren&lt;/a&gt;&apos;s blog post series, &lt;a href=&quot;http://mattwarren.org/2016/02/04/learning-how-garbage-collectors-work-part-1/&quot;&gt;&lt;em&gt;Learning How Garbage Collectors Work&lt;/em&gt;&lt;/a&gt; here and pre-order &lt;a href=&quot;https://twitter.com/konradkokosa&quot;&gt;Konrad Kokosa&lt;/a&gt;&apos;s book, &lt;a href=&quot;https://prodotnetmemory.com/&quot;&gt;&lt;em&gt;Pro .NET Memory Management&lt;/em&gt;  here&lt;/a&gt;. Also check out his fantastic free &lt;a href=&quot;https://prodotnetmemory.com/data/netmemoryposter.pdf&quot;&gt;.NET memory management poster here&lt;/a&gt;, it&apos;s a great reference.&lt;/p&gt;
&lt;h3&gt;Pool buffers/objects&lt;/h3&gt;
&lt;p&gt;Hagar itself doesn&apos;t manage buffers but instead defers the responsibility to the user. This might sound onerous but it&apos;s not, since it&apos;s compatible with &lt;a href=&quot;https://blogs.msdn.microsoft.com/dotnet/2018/07/09/system-io-pipelines-high-performance-io-in-net/&quot;&gt;&lt;code&gt;System.IO.Pipelines&lt;/code&gt;&lt;/a&gt;. Therefore, we can take advantage of the high performance buffer pooling which the default &lt;code&gt;Pipe&lt;/code&gt; provides by means of &lt;code&gt;System.Buffers.ArrayPool&amp;lt;T&amp;gt;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Generally speaking, reusing buffers lets you put much less pressure on the GC - your users will be thankful. Don&apos;t write your own buffer pool, unless you truly need to, though - those times have passed.&lt;/p&gt;
&lt;p&gt;:::caution
Reach for &lt;code&gt;ArrayPool&amp;lt;T&amp;gt;&lt;/code&gt; or &lt;code&gt;System.IO.Pipelines&lt;/code&gt; before building your own pool. Custom pooling code is easy to get subtly wrong and hard to benchmark honestly.
:::&lt;/p&gt;
&lt;h3&gt;Avoid boxing&lt;/h3&gt;
&lt;p&gt;Wherever possible, do not box value types by casting them to a reference type. This is common advice, but it requires some consideration in your API design. In Hagar, interface and method definitions which might accept value types are made generic so that they can be specialized to the precise type and avoid boxing/unboxing costs. As a result, there is no hot-path boxing. Boxing is still present in some cases, such as string formatting for exception methods. Those particular boxing allocations can be removed by explicit &lt;code&gt;.ToString()&lt;/code&gt; calls on the arguments.&lt;/p&gt;
&lt;p&gt;:::warning
Boxing on a hot path is easy to miss because the code still looks clean. Generic APIs often pay for themselves here by letting the JIT specialize away the allocation.
:::&lt;/p&gt;
&lt;h3&gt;Reduce closure allocations&lt;/h3&gt;
&lt;p&gt;Allocate closures only once and store the result for repeated use. For example, it&apos;s common to pass a delegate to &lt;code&gt;ConcurrentDictionary&amp;lt;K, V&amp;gt;.GetOrAdd&lt;/code&gt;. Instead of writing the delegate as an inline lambda, define is as a private field on the class. Here an example from the optional &lt;code&gt;ISerializable&lt;/code&gt; support package in Hagar:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;private readonly Func&amp;lt;Type, Action&amp;lt;object, SerializationInfo, StreamingContext&amp;gt;&amp;gt; createConstructorDelegate;

public ObjectSerializer(SerializationConstructorFactory constructorFactory)
{
    // Other parameters/statements omitted.
    this.createConstructorDelegate = constructorFactory.GetSerializationConstructorDelegate;
}

// Later, on a hot code path:
var constructor = this.constructors.GetOrAdd(info.ObjectType, this.createConstructorDelegate);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Minimize copying&lt;/h2&gt;
&lt;p&gt;.NET Core 2.0 and 2.1 and recent C# versions have made considerable strides in allowing library developers to eliminate data copying. The most notable addition is &lt;code&gt;Span&amp;lt;T&amp;gt;&lt;/code&gt;, but it&apos;s also worth mentioning &lt;code&gt;in&lt;/code&gt; parameter modifiers and &lt;code&gt;readonly struct&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Use &lt;code&gt;Span&amp;lt;T&amp;gt;&lt;/code&gt; to avoid array allocations and avoid data copying&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;Span&amp;lt;T&amp;gt;&lt;/code&gt; and friends are a gigantic performance win for .NET, particularly .NET Core where they use an optimized representation to reduce their size, which required adding GC support for interior pointers. Interior pointers are managed references which point to within the bounds of an array, as opposed to only being able to point to the first element and therefore requiring an additional field containing an offset into the array. For more info on &lt;code&gt;Span&amp;lt;T&amp;gt;&lt;/code&gt; and friends, read Stephen Toub&apos;s article, &lt;a href=&quot;https://msdn.microsoft.com/en-us/magazine/mt814808.aspx&quot;&gt;&lt;em&gt;All About Span: Exploring a New .NET Mainstay&lt;/em&gt; here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Hagar makes extensive use of &lt;code&gt;Span&amp;lt;T&amp;gt;&lt;/code&gt; because it allows us to cheaply create views over small sections of larger buffers to work with. Enough has been written on the subject that there&apos;s no use me writing more here.&lt;/p&gt;
&lt;h3&gt;Pass structs by &lt;code&gt;ref&lt;/code&gt; to minimize on-stack copies&lt;/h3&gt;
&lt;p&gt;Hagar uses two main structs, &lt;code&gt;Reader&lt;/code&gt; &amp;amp; &lt;code&gt;Writer&amp;lt;TOutputBuffer&amp;gt;&lt;/code&gt;. These structs each contain several fields and are passed to almost every call along the serialization/deserialization call path.&lt;/p&gt;
&lt;p&gt;Without intervention, each method call made with these structs would carry significant weight since the entire struct would need to be copied onto the stack for every call, not to mention any mutations would need to be copied back to the caller.&lt;/p&gt;
&lt;p&gt;We can avoid that cost by passing these structs as &lt;code&gt;ref&lt;/code&gt; parameters. C# also supports using &lt;code&gt;ref this&lt;/code&gt; as the target for an extension method, which is very convenient. As far as I know, there&apos;s no way to ensure that a particular struct type is always passed by ref and this can lead to subtle bugs if you accidentally omit &lt;code&gt;ref&lt;/code&gt; in the parameter list of a call, since the struct will be silently copied and modifications made by a method (eg, advancing a write pointer) will be lost.&lt;/p&gt;
&lt;h3&gt;Avoid defensive copies&lt;/h3&gt;
&lt;p&gt;Roslyn has to do some work to guarantee some language invariants sometimes. When a &lt;code&gt;struct&lt;/code&gt; is stored in a &lt;code&gt;readonly&lt;/code&gt; field, the compiler will insert instructions to &lt;em&gt;defensively copy&lt;/em&gt; that field before involving it in any operation which isn&apos;t guaranteed to &lt;em&gt;not&lt;/em&gt; mutate it. Typically this means calls to method defined on the struct type itself because passing a struct as argument to a method defined on another type already requires copying the struct onto the stack (unless it&apos;s passed by &lt;code&gt;ref&lt;/code&gt; or &lt;code&gt;in&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;This defensive copy can be avoided if the struct is defined as a &lt;code&gt;readonly struct&lt;/code&gt;, which is a C# 7.2 language feature, enabled by adding &lt;code&gt;&amp;lt;LangVersion&amp;gt;7.2&amp;lt;/LangVersion&amp;gt;&lt;/code&gt; to your csproj file.&lt;/p&gt;
&lt;p&gt;Sometimes it&apos;s better to omit the &lt;code&gt;readonly&lt;/code&gt; modifier on an otherwise immutable struct field if you are unable to define it as a &lt;code&gt;readonly struct&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;See Jon Skeet&apos;s NodaTime library as an example. In &lt;a href=&quot;https://github.com/nodatime/nodatime/pull/1130&quot;&gt;this PR&lt;/a&gt;, Jon made most structs &lt;code&gt;readonly&lt;/code&gt; and was therefore able to add the &lt;code&gt;readonly&lt;/code&gt; modifier to fields holding those structs without negatively impacting performance.&lt;/p&gt;
&lt;h2&gt;Reduce branching &amp;amp; branch misprediction&lt;/h2&gt;
&lt;p&gt;Modern CPUs rely on having long pipelines of instructions which are processed with some concurrency. This involves the CPU analyzing instructions to determine which ones aren&apos;t reliant on previous instructions and also involves guessing which conditional jump statements are going to be taken. In order to do this, the CPU uses a component called the branch predictor which is responsible for guessing which branch will be taken. It typically does this by reading &amp;amp; writing entries in a table, revising its prediction based upon what happened last time the conditional jump was executed.&lt;/p&gt;
&lt;p&gt;When it guesses correctly, this prediction process provides a substantial speedup. When it mispredicts the branch (jump target), however, it needs to throw out all of the work performed in processing instructions after the branch and re-fill the pipeline with instructions from the correct branch before continuing execution.&lt;/p&gt;
&lt;p&gt;The fastest branch is no branch. First try to minimize the number of branches, always measuring whether or not your alternative is faster. When you cannot eliminate a branch, try to minimize misprediction rates. This may involve &lt;a href=&quot;https://stackoverflow.com/a/11227902/635314&quot;&gt;using sorted data&lt;/a&gt; or restructuring your code.&lt;/p&gt;
&lt;p&gt;One strategy for eliminating a branch is to replace it with a lookup. Sometimes an algorithm can be made branch-free instead of using conditionals. Sometimes &lt;a href=&quot;https://mijailovic.net/2018/06/06/sha256-armv8/&quot;&gt;hardware&lt;/a&gt; &lt;a href=&quot;https://blogs.msdn.microsoft.com/dotnet/2018/10/10/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/&quot;&gt;intrinsics&lt;/a&gt; can be used to eliminate branching.&lt;/p&gt;
&lt;h2&gt;Other miscellaneous tips&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Avoid LINQ. LINQ is great in application code, but rarely belongs on a hot path in library/framework code. LINQ is difficult for the JIT to optimize (&lt;code&gt;IEnumerable&amp;lt;T&amp;gt;&lt;/code&gt;...) and tends to be allocation-happy.&lt;/li&gt;
&lt;li&gt;Use concrete types instead of interfaces or abstract types. This was mentioned above in the context of inlining, but this has other benefits. Perhaps the most common being that if you are iterating over a &lt;code&gt;List&amp;lt;T&amp;gt;&lt;/code&gt;, it&apos;s best to &lt;em&gt;not&lt;/em&gt; cast that list to &lt;code&gt;IEnumerable&amp;lt;T&amp;gt;&lt;/code&gt; first (eg, by using LINQ or passing it to a method as an &lt;code&gt;IEnumerable&amp;lt;T&amp;gt;&lt;/code&gt; parameter). The reason for this is that enumerating over a list using &lt;code&gt;foreach&lt;/code&gt; uses a non-allocating &lt;code&gt;List&amp;lt;T&amp;gt;.Enumerator&lt;/code&gt; struct, but when it&apos;s cast to &lt;code&gt;IEnumerable&amp;lt;T&amp;gt;&lt;/code&gt;, that struct must be boxed to &lt;code&gt;IEnumerator&amp;lt;T&amp;gt;&lt;/code&gt; for &lt;code&gt;foreach&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Reflection is exceptionally useful in library code, but it &lt;em&gt;will&lt;/em&gt; kill you if you give it the chance. Cache the results of reflection, consider generating delegates for accessors using IL or Roslyn, or better yet, use an existing library such as &lt;a href=&quot;https://github.com/aspnet/Common/blob/ff87989d893b000aac1bfef0157c92be1f04f714/shared/Microsoft.Extensions.ObjectMethodExecutor.Sources/ObjectMethodExecutor.cs&quot;&gt;&lt;code&gt;Microsoft.Extensions.ObjectMethodExecutor.Sources&lt;/code&gt;&lt;/a&gt;, &lt;a href=&quot;https://github.com/aspnet/Common/blob/ff87989d893b000aac1bfef0157c92be1f04f714/shared/Microsoft.Extensions.PropertyHelper.Sources/PropertyHelper.cs&quot;&gt;&lt;code&gt;Microsoft.Extensions.PropertyHelper.Sources&lt;/code&gt;&lt;/a&gt;, or &lt;a href=&quot;https://github.com/mgravell/fast-member&quot;&gt;&lt;code&gt;FastMember&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Library-specific optimizations&lt;/h2&gt;
&lt;h2&gt;Optimize generated code&lt;/h2&gt;
&lt;p&gt;Hagar uses Roslyn to generate C# code for the POCOs you want to serialize, and this C# code is included in your project at compile time. There are some optimizations which we can perform on the generated code to make things faster.&lt;/p&gt;
&lt;h3&gt;Avoid virtual calls by skipping codec lookup for well-known types&lt;/h3&gt;
&lt;p&gt;When complex objects contain well known fields such as &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;Guid&lt;/code&gt;, &lt;code&gt;string&lt;/code&gt;, the code generator will directly insert calls to the hand-coded codecs for those types instead of calling into the &lt;code&gt;CodecProvider&lt;/code&gt; to retrieve an &lt;code&gt;IFieldCodec&amp;lt;T&amp;gt;&lt;/code&gt; instance for that type. This lets the JIT inline those calls and avoids virtual/interface indirection.&lt;/p&gt;
&lt;h3&gt;(Unimplemented) Specialize generic types at runtime&lt;/h3&gt;
&lt;p&gt;Similar to above, the code generator could generate code which uses specialization at runtime.&lt;/p&gt;
&lt;h2&gt;Pre-compute constant values to eliminate some branching&lt;/h2&gt;
&lt;p&gt;During serialization, each field is prefixed with a header – usually a single byte – which tells the deserializer which field was encoded. This field header contains 3 pieces of info: the wire type of the field (fixed-width, length-prefixed, tag-delimited, referenced, etc), the schema type of the field (expected, well-known, previously-defined, encoded) which is used for polymorphism, and dedicates the last 3 bits to encoding the field id (if it&apos;s less than 7). In many cases, it&apos;s possible to know exactly what this header byte will be at compile time. If a field has a value type, then we know that the runtime type can never differ from the field type and we always know the field id.&lt;/p&gt;
&lt;p&gt;Therefore, we can often save all of the work required to compute the header value and can directly embed it into code as a constant. This saves branching and generally eliminates a lot of IL code.&lt;/p&gt;
&lt;h2&gt;Choose appropriate data structures&lt;/h2&gt;
&lt;p&gt;One of the big performance disadvantages Hagar has when compared to other serializers such as &lt;a href=&quot;https://github.com/mgravell/protobuf-net&quot;&gt;protobuf-net&lt;/a&gt; (in its default configuration?) and &lt;a href=&quot;https://github.com/neuecc/MessagePack-CSharp&quot;&gt;MessagePack-CSharp&lt;/a&gt; is that it supports cyclic graphs and therefore must track objects as they&apos;re serialized so that object cycles are not lost during deserialization. When this was first implemented, the core data structure was a &lt;code&gt;Dictionary&amp;lt;object, int&amp;gt;&lt;/code&gt;. It was clear in initial benchmarking that reference tracking was a dominating cost. In particular, clearing the dictionary between messages was expensive. By switching to an array of structs instead, the cost of indexing and maintaining the collection is largely eliminated and reference tracking no longer appears in the benchmarks. There is a downside to this: for large object graphs it&apos;s likely that this new approach is slower. If that becomes an issue, we can decide to dynamically switch between implementations.&lt;/p&gt;
&lt;h2&gt;Choose appropriate algorithms&lt;/h2&gt;
&lt;p&gt;Hagar spends a lot of time encoding/decoding variable-length integers, often referred to as varints, in order to reduce the size of the payload (which can be more compact for storage/transport). Many binary serializers use this technique, including &lt;a href=&quot;https://developers.google.com/protocol-buffers/docs/encoding#varints&quot;&gt;Protocol Buffers&lt;/a&gt;. Even .NET&apos;s BinaryWriter uses this encoding. Here&apos;s a &lt;a href=&quot;https://github.com/Microsoft/referencesource/blob/60a4f8b853f60a424e36c7bf60f9b5b5f1973ed1/mscorlib/system/io/binarywriter.cs#L414&quot;&gt;snippet from the reference source&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;protected void Write7BitEncodedInt(int value) {
    // Write out an int 7 bits at a time.  The high bit of the byte,
    // when on, tells reader to continue reading more bytes.
    uint v = (uint) value;   // support negative numbers
    while (v &amp;gt;= 0x80) {
        Write((byte) (v | 0x80));
        v &amp;gt;&amp;gt;= 7;
    }
    Write((byte)v);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Looking at this source, I want to point out that &lt;a href=&quot;https://developers.google.com/protocol-buffers/docs/encoding#signed-integers&quot;&gt;ZigZag encoding&lt;/a&gt; may be more efficient for signed integers which contain negative values, rather than casting to &lt;code&gt;uint&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;VarInts in these serializers use an algorithm called Little Endian Base-128 or LEB128, which encodes up to 7 bits per byte. It uses the most significant bit of each byte to indicate whether or not another byte follows (1 = yes, 0 = no). This is a simple format but it may not be the fastest. It might turn out that PrefixVarint is faster. With PrefixVarint, all of those 1s from LEB128 are written in one shot, at the beginning of the payload. This may let us use &lt;a href=&quot;https://mijailovic.net/2018/06/06/sha256-armv8/&quot;&gt;hardware&lt;/a&gt; &lt;a href=&quot;https://blogs.msdn.microsoft.com/dotnet/2018/10/10/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/&quot;&gt;intrinsics&lt;/a&gt; to improve the speed of this encoding &amp;amp; decoding. By moving the size information to the front, we may also be able to read more bytes at a time from the payload, reducing internal bookkeeping and improving performance. If someone wants to implement this in C#, I will happily take a PR if it turns out to be faster.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Hopefully you&apos;ve found something useful in this post. &lt;a href=&quot;https://twitter.com/reubenbond&quot;&gt;Let me know&lt;/a&gt; if something is unclear or you have something to add. Since I started writing this, I&apos;ve moved to Redmond and officially joined Microsoft on the &lt;a href=&quot;https://github.com/dotnet/orleans&quot;&gt;Orleans&lt;/a&gt; team, working on some very exciting things.&lt;/p&gt;
</content:encoded><author>Reuben Bond</author></item><item><title>CASPaxos</title><link>https://reubenbond.github.io/posts/caspaxos</link><guid isPermaLink="true">https://reubenbond.github.io/posts/caspaxos</guid><description>Linearizable databases without logs</description><pubDate>Tue, 21 Jan 2020 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Recently I&apos;ve been playing around with a new algorithm known as &lt;a href=&quot;https://arxiv.org/abs/1802.07000&quot;&gt;CASPaxos&lt;/a&gt;. In this post I&apos;m going to talk about the algorithm and its potential benefits for distributed databases, particularly key-value stores.&lt;/p&gt;
&lt;p&gt;Distributed databases must be &lt;strong&gt;reliable&lt;/strong&gt; and &lt;strong&gt;scalable&lt;/strong&gt;. To achieve reliability, DBs replicate data to other servers. To achieve scalability in terms of total storage capacity, DBs must allow the data to be replicated to only a subset of servers - enough to make the data reasonably reliable but not so much that adding a new server does not increase the total storage capacity of the system or make the system unbearably slow. A typical replication factor is 3: each piece of data is stored on 3 servers. Replications is typically implemented using a consensus algorithm. Well-known algorithms in this family that are used for replication are Raft, Multi-Paxos, and ZAB (which is used in ZooKeeper). Those 3 algorithms make servers agree on the ordering of operations in a log. By executing those operations in order, the database engines on each server can create identical replicas of a database. Logs feature very prominently in distributed/reliable systems (Read &lt;em&gt;&lt;a href=&quot;https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying&quot;&gt;The Log: What every software engineer should know about real-time data&apos;s unifying abstraction&lt;/a&gt;&lt;/em&gt; by Jay Kreps).&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1802.07000&quot;&gt;CASPaxos&lt;/a&gt; is a new algorithm in this space and it is significantly simpler than the aforementioned algorithms because it does not use log replication. It is a slight modification of the original Paxos algorithm, which is very simple and typically used as a minimal building block for more complicated algorithms such as Multi-Paxos. Instead of replicating log entries between servers, CASPaxos replicates entire values. Because of this, it is best suited for relatively small values, such as individual entries in a key-value store.&lt;/p&gt;
&lt;p&gt;So why is this interesting? In short: it offers us simplicity &amp;amp; performance. Before getting into its benefits, here&apos;s a &lt;strong&gt;sloppy, inaccurate description of CASPaxos - &lt;a href=&quot;https://arxiv.org/abs/1802.07000&quot;&gt;I recommend you read the paper&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;:::tip
&lt;strong&gt;Why it stands out:&lt;/strong&gt; CASPaxos replaces replicated logs with replicated values. That keeps the core protocol small and makes it a useful mental model before tackling full replicated-log systems.
:::&lt;/p&gt;
&lt;h2&gt;CASPaxos&lt;/h2&gt;
&lt;p&gt;CASPaxos replicates changes to a single register amongst a set of replicas. The register holds a user-defined value which is modified by successive application of some change function (which is a closure). Each of these modifications are protected by version stamps (ballot numbers) which help to ensure that previously committed register values are not clobbered without being first observed by the writer. The protocol facilitates learning previously committed values so that replicas can keep up with one another.&lt;/p&gt;
&lt;p&gt;If you are familiar with Raft, you will know that at its core it replicates a log of values. Conceptually, a log-based replicated state machine folds a fixed function over multiple data (the log entries). By contrast, CASPaxos does not use a fixed function and instead folds varying closures over state, with the resulting state itself being replicated to other replicas.&lt;/p&gt;
&lt;p&gt;To illustrate, the following expansions show the result of applying &lt;code&gt;[e0, e1, e2]&lt;/code&gt; (log entries) in Raft, versus &lt;code&gt;[f0, f1, f2]&lt;/code&gt; (closures) in CASPaxos:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Raft: &lt;code&gt;state = f(e2, f(e1, f(e0, ∅)]))&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;CASPaxos: &lt;code&gt;state = f2(f1(f0(∅)))&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Aside from what gets replicated and how the current state of the system is computed, Raft and CASPaxos are vastly different. For example, CASPaxos is leaderless, whereas Raft uses a strong leader. CASPaxos does not specify the use of heartbeats (in the core algorithm), whereas Raft does. Many of these differences are present because Raft is a more &lt;em&gt;batteries included&lt;/em&gt; algorithm which covers much of the practical concerns involved in building a replicated database.&lt;/p&gt;
&lt;p&gt;Neither approach is strictly better than the other, but since the CASPaxos approach (replicating state values rather than log entries) was fairly novel to me in the context of distributed conensus, I&apos;d like to explore some of the implications, especially as they might apply to the systems I work on.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1802.07000&quot;&gt;Read the paper&lt;/a&gt; to understand the algorithm in more detail.&lt;/p&gt;
&lt;h2&gt;Simplicity&lt;/h2&gt;
&lt;p&gt;The canonical implementation of CASPaxos by its author &lt;a href=&quot;https://twitter.com/rystsov&quot;&gt;Denis Rystsov (@rystsov)&lt;/a&gt; is &lt;a href=&quot;https://github.com/gryadka/js&quot;&gt;Gryadka&lt;/a&gt;, a key-value store written in JavaScript which sits atop Redis. The core, including the CASPaxos implementation, has less than 500 lines of code. &lt;a href=&quot;https://raft.github.io/&quot;&gt;Raft&lt;/a&gt; was also designed to be a simple and understandable algorithm, but it carries with it the weight of log replication, which brings with it the need for log compaction, which brings with it the need for snapshotting and snapshot transfer. Raft also requires leadership elections because it is built around the concept of a &quot;strong leader&quot;. All writes must be served by the single master in a Raft system, whereas writes can be served by any replica in a CASPaxos system. CASPaxos is simpler to implement than Raft. The &lt;a href=&quot;https://raft.github.io/raft.pdf&quot;&gt;extended Raft paper&lt;/a&gt; is a great read. &lt;a href=&quot;https://github.com/ongardie/dissertation#readme&quot;&gt;Diego Ongaro&apos;s Ph. D dissertation&lt;/a&gt; includes an important simplification to the original paper&apos;s membership change algorithm. Let&apos;s be clear here: Raft definitely achieved its goal of understandability and it truly deserves the widespread adoption it&apos;s seen.&lt;/p&gt;
&lt;p&gt;:::important
&lt;strong&gt;What the simplicity buys you:&lt;/strong&gt; if your workload looks like a replicated key-value store, fewer moving parts means less machinery for leader routing, log compaction, and snapshot transfer.
:::&lt;/p&gt;
&lt;h2&gt;Storage Performance&lt;/h2&gt;
&lt;p&gt;To analyse the performance implications of CASPaxos, we need to take a little detour and discuss real-world systems. One great example is &lt;a href=&quot;https://www.cockroachlabs.com/&quot;&gt;CockroachDB&lt;/a&gt;, a distributed SQL database. CockroachDB aims to be &lt;strong&gt;reliable&lt;/strong&gt; and &lt;strong&gt;scalable&lt;/strong&gt;. To achieve this, they partition their data and replicate each piece of data to a subset of the servers in the system using an algorithm they call &lt;a href=&quot;https://www.cockroachlabs.com/blog/scaling-raft/&quot;&gt;MultiRaft&lt;/a&gt;. If they were to use a single Raft consensus group, then adding additional servers would not increase the total capacity of the database. If they use many Raft consensus groups naively, the overhead of each consensus group would have a toll on throughput. For example, Raft requires heartbeat messages while idle to maintain leadership. MultiRaft requires multiplexing each consensus group&apos;s log records on disk for performance. That means that log entries for each group might not live near each other on disk, since they are interspersed with many other groups&apos; records. This may take a toll on recovery performance. The alternative is to store each group&apos;s log in contiguous disk segments, but this reduces write throughput: spinning disks and SSDs both perform better when operating sequentially. The optimizations required to make Raft scale well are tricky largely because of its log-based nature.&lt;/p&gt;
&lt;p&gt;Speaking of storage, let&apos;s talk briefly about storage engines. The storage engine is the database component responsible for reading and writing data in a reliable way. Examples include RocksDB, LMDB, ESENT (used in Exchange &amp;amp; Active Directory), WiredTiger, TokuDB, and InnoDB. Two of the most common data structures for implementing a storage engine are &lt;a href=&quot;https://en.wikipedia.org/wiki/B%2B_tree&quot;&gt;B+ Trees&lt;/a&gt; and more recently, &lt;a href=&quot;https://en.wikipedia.org/wiki/Log-structured_merge-tree&quot;&gt;Log-Structured Merge-Trees&lt;/a&gt; (LSM trees). In order to make B+ Trees reliable (any machine may crash at any time), a &lt;a href=&quot;https://sqlite.org/wal.html&quot;&gt;Write-Ahead Log&lt;/a&gt; (WAL) is used. This log is a file containing a sequential list of the database transactions which are being performed. The storage engine eventually applies these transactions to the database image. During crash recovery, the storage engine reads this file and ensures that all of the committed transactions have been applied. This recovery algorithm is called &lt;a href=&quot;https://en.wikipedia.org/wiki/Algorithms_for_Recovery_and_Isolation_Exploiting_Semantics&quot;&gt;ARIES&lt;/a&gt; and it can be found in many reliable storage engines. So B+ Trees split your data into two parts: a log file and a tree. Log-Structured Merge-Trees also generally adopt a Write-Ahead Log for recovery. Since spinning disks and SSDs perform best with sequential reads &amp;amp; writes, log files are a good fit for high-performance, reliable systems.&lt;/p&gt;
&lt;p&gt;Raft is built around log replication, so it might make sense to integrate with the storage engine so that a single log can be used for both purposes: local durability as well as replication. Unfortunately, the storage engine&apos;s log is generally not visible to the storage engine consumer and is usually considered an implementation detail. This means that Raft implementations which use an off-the-shelf storage engine such as RocksDB must store log records inside the storage engine so that they can be read back later. The result is that each operation needs at least 2 writes (1 on the critical path): one for the log entry and one for the result of applying the log entry once it&apos;s committed (eg, updating a value in a key-value store). A B+ Tree engine needs 4 writes (1 critical). By contrast, CASPaxos needs just 1 write: updating the value itself. Log-based algorithms have natural write amplification where as CASPaxos does not.&lt;/p&gt;
&lt;p&gt;By removing the need for logs, CASPaxos can achieve high write throughput with off-the-shelf storage engines.&lt;/p&gt;
&lt;p&gt;:::tip
&lt;strong&gt;Write amplification matters:&lt;/strong&gt; when the storage engine already maintains its own WAL, layering a replicated log on top often means writing both the log entry and the materialized result. CASPaxos avoids that extra replicated-log layer.
:::&lt;/p&gt;
&lt;h2&gt;Coordination&lt;/h2&gt;
&lt;p&gt;Each key in a key-value store based on CASPaxos is completely independent of all other keys. This means that no cross-key coordination is required when serving operations on individual keys. Compare this with Raft or MultiRaft where all operations within a given consensus group are strictly ordered. This ordering requires coordination which has some overhead. It means that a slow operation on one key can more easily impact operations on other keys. The low level of coordination required by CASPaxos supports high-concurrency systems without added complexity.&lt;/p&gt;
&lt;p&gt;Coordination is sometimes required. For example, when implementing multi-object transactions. Multi-object transactions can be implemented as a higher layer on top of a key-value store with linearizable keys using &lt;a href=&quot;https://en.wikipedia.org/wiki/Two-phase_commit_protocol&quot;&gt;two-phase commit (2PC)&lt;/a&gt;. For example, this is how we implement &lt;a href=&quot;https://www.microsoft.com/en-us/research/publication/transactions-distributed-actors-cloud-2/&quot;&gt;ACID transactions in Orleans&lt;/a&gt;, supporting any strong consistency key-value store.&lt;/p&gt;
&lt;h2&gt;Challenges&lt;/h2&gt;
&lt;p&gt;So far we&apos;ve talked about ways in which CASPaxos might be more suitable for building a distributed key-value store than Raft or MultiRaft. CASPaxos is a simple algorithm and there are many system design questions which are not addressed by the paper definition. So here are some potential challenges when building a real-world system on CASPaxos, as well as some thoughts on how to solve them.&lt;/p&gt;
&lt;p&gt;:::warning
The paper defines the core protocol, not a full production database. The remaining sections are the practical questions you still need to answer when turning CASPaxos into a complete system.
:::&lt;/p&gt;
&lt;h2&gt;Server Catch-up&lt;/h2&gt;
&lt;p&gt;When adding a new server to the database system, the server needs to be brought up to speed with the existing servers. This requires adding it to the consensus group as well as copying all data for the keys which it will be replicating. The CASPaxos paper describes this process as a part of membership change. However, a similar process is needed to ensure that data is sufficiently reliable. For example, if a server loses network connectivity for a few seconds then it may miss some updates to some rarely updated keys. The CASPaxos algorithm does not discuss how to ensure that all updates are eventually replicated. In Raft, it is the leader&apos;s responsibility to keep followers up to speed. In a system built around CASPaxos, which is leaderless, we will likely need to implement a different solution.&lt;/p&gt;
&lt;h2&gt;Membership Change&lt;/h2&gt;
&lt;p&gt;The membership change algorithm in the paper does not offer safety in all cases and it implies a single administrator in the system. Therefore, it is not suitable for use with automated cluster management systems. The &lt;a href=&quot;https://github.com/ReubenBond/orleans/tree/poc-caspaxos/src/Orleans.MetadataStore&quot;&gt;proof-of-concept CASPaxos implementation&lt;/a&gt; on &lt;a href=&quot;https://dotnet.github.io/orleans/Documentation/Introduction.html&quot;&gt;Orleans&lt;/a&gt;, uses a &lt;a href=&quot;https://github.com/ReubenBond/orleans/blob/f617b0ce67079a6b79c80fa3c73540fe24d2db7b/src/Orleans.MetadataStore/Configuration/ConfigurationManager.cs#L138&quot;&gt;different membership change algorithm&lt;/a&gt;. It ought to be suitable for automated systems (such a the &lt;a href=&quot;https://dotnet.github.io/orleans/Documentation/Runtime-Implementation-Details/Cluster-Management.html&quot;&gt;cluster membership algorithm used in Orleans&lt;/a&gt;). I believe the algorithm will be safe once fully implemented, but that has not been demonstrated yet. The key idea is to leverage the consensus mechanism of the protocol for cluster membership change, similar to how Raft and Multi-Paxos commit configuration changes to the log. It uses a special purpose register to store cluster configuration. Proposers indicate which version of the configuration they are using in all calls to Acceptors and Acceptors reject requests from Proposers running old configurations. This is similar to Raft&apos;s notion of neutralizing old leaders. Additionally membership changes are restricted to at-most one server at a time, which is a special case of &lt;em&gt;joint consensus&lt;/em&gt;. This the same restriction that Diego Ongaro specified in &lt;a href=&quot;https://github.com/ongardie/dissertation#readme&quot;&gt;his Ph. D dissertation&lt;/a&gt; for Raft. In a sense, this extension turns CASPaxos into a 2-level store with the cluster configuration register at the top and data registers below, so the ballot vector is &lt;code&gt;[configuration ballot, data ballot]&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Scale-out&lt;/h2&gt;
&lt;p&gt;Adding additional servers should increase the total storage capacity of the system. CASPaxos specifies only the minimal building block of a key-value store, so this scale-out is not discussed in the paper. The Raft paper also does not specify this, motivating the development of MultiRaft for CockroachDB. The dynamic range-based partitioning scheme used by CockroachDB is a good candidate. Implementing this might involve storing range configurations in registers and extending the membership change modification to include 3 levels, &lt;code&gt;[cluster ballot, range ballot, data ballot]&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Large Values&lt;/h2&gt;
&lt;p&gt;CASPaxos is not suitable for replicating large values because each value is sent over the wire every time it is updated. For a replication factor of 3, the entire value is sent 3 times for every update and 6 times if the proposer cannot take advantage of the &lt;em&gt;distinguished leader&lt;/em&gt; optimization.&lt;/p&gt;
&lt;p&gt;This limitation could be alleviated in several ways, or it can be ignored and argued away, leaving users to tackle the problem themselves if they truly need large values.&lt;/p&gt;
&lt;p&gt;:::caution
CASPaxos shines when values are modest in size. If updates routinely move large blobs, the simplicity win can be eroded by network and storage bandwidth costs.
:::&lt;/p&gt;
&lt;p&gt;One way to alleviate it might be to split keys over several registers. Without going into detail, this might involve extending the membership change modification yet again to include 4 levels, at which point it may make sense to generalize it into a &lt;em&gt;ballot vector&lt;/em&gt;, &lt;code&gt;[...parent ballots, register ballot]&lt;/code&gt;. Specifically, &lt;code&gt;[config ballot, range ballot, file ballot, register ballot]&lt;/code&gt;. At this point, the system is structured more like a tree than a flat key-value store.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I hope you&apos;ve enjoyed the post. If you&apos;d like to discuss any aspects of it, for example some glaring inaccuracies, drop me a line via Twitter (&lt;a href=&quot;https://twitter.com/reubenbond&quot;&gt;@ReubenBond&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Distributed Systems is a young field with many exciting areas for research and development.&lt;/p&gt;
</content:encoded><author>Reuben Bond</author></item><item><title>Fast CASPaxos</title><link>https://reubenbond.github.io/posts/fast-caspaxos</link><guid isPermaLink="true">https://reubenbond.github.io/posts/fast-caspaxos</guid><description>Fast Paxos-style fast rounds for a rewritable register</description><pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;:::note
&lt;strong&gt;TL;DR:&lt;/strong&gt; Fast CASPaxos adds leaderless commits to CASPaxos so that any proposer can commit an update to a shared register with a single round-trip to a quorum.
I&apos;ve written about &lt;a href=&quot;https://reubenbond.github.io/posts/caspaxos&quot;&gt;CASPaxos&lt;/a&gt; before.
:::&lt;/p&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Classic Paxos (aka &quot;single-decree Paxos&quot;) lets a group of servers agree on a single, &lt;strong&gt;immutable&lt;/strong&gt; value. It divides its responsibilities into two main roles: &lt;strong&gt;proposers&lt;/strong&gt;, which initiate rounds and choose candidate values, and &lt;strong&gt;acceptors&lt;/strong&gt;, which persist promises and accepted values. In most implementations, these roles are bundled together into every server, but it&apos;s common to talk about them as though they are separate processes and we won&apos;t deviate from the norm in this post.&lt;/p&gt;
&lt;p&gt;CASPaxos extends Paxos into a rewritable register, so proposers can &lt;strong&gt;mutate&lt;/strong&gt; the value over time while maintaining strong consistency guarantees (linearizability) so that no acknowledged write is ever clobbered. Classic Paxos requires two messaging round-trips to decide on a value. That is, it operates in 2 phases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a &lt;code&gt;prepare&lt;/code&gt; phase where a proposer tries to become leader and learns the current state from acceptors, and&lt;/li&gt;
&lt;li&gt;an &lt;code&gt;accept&lt;/code&gt; phase where that proposer asks acceptors to accept a value.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A common Paxos optimization is to have each &lt;code&gt;accept&lt;/code&gt; message piggyback the next &lt;code&gt;prepare&lt;/code&gt; message so the same proposer can update the register via a single &lt;code&gt;accept+prepare&lt;/code&gt; message instead of needing two separate round-trips.&lt;/p&gt;
&lt;p&gt;My contribution is &lt;em&gt;Fast CASPaxos&lt;/em&gt;, which &lt;strong&gt;lets any proposer update the register in a single &lt;code&gt;accept+prepare&lt;/code&gt; request&lt;/strong&gt;, not only the proposer that already holds leadership. It works by leveraging an insight from &lt;a href=&quot;https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2005-112.pdf&quot;&gt;Fast Paxos&lt;/a&gt;. After coming up with Fast CASPaxos and going through all of the work to propagandize myself about it through deterministic simulation testing, TLA+ models, showers, etc, I discovered that key ideas were already discovered by Heidi Howard and presented in her thesis titled &lt;a href=&quot;https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf&quot;&gt;Distributed consensus revised&lt;/a&gt;. Still, Fast CASPaxos brings together those ideas and the core idea from CASPaxos in a way that I believe is novel and it identifies one point along the &lt;a href=&quot;https://en.wikipedia.org/wiki/Pareto_front&quot;&gt;Pareto frontier&lt;/a&gt; of optimal distributed consensus algorithms. An interesting side-point about Fast CASPaxos is that it&apos;s cheap to switch between leadered and leaderless consensus at runtime, so theoretically you could create an adaptive algorithm that lets you decide based on observed conditions (conflict rates, relative latency, available servers, etc).&lt;/p&gt;
&lt;p&gt;The rest of the post covers leadered-vs-leaderless consensus, CASPaxos, and Fast CASPaxos.&lt;/p&gt;
&lt;h2&gt;Leader-based vs leaderless consensus&lt;/h2&gt;
&lt;p&gt;Classic Paxos, Multi-Paxos, Raft, and CASPaxos are all examples of leader-based consensus. In leader-based consensus, before a proposer can commit a value, it must first obtain mutually exclusive (but revocable) rights to do so. Leader election grants this right. In Paxos, the &lt;code&gt;prepare&lt;/code&gt; phase is where a proposer tries to acquire that right and the &lt;code&gt;accept&lt;/code&gt; phase is where it tries to commit a value. If leadership is uncontended then each phase typically takes 1 round-trip (one message back and forth between the proposer and acceptors). Multi-Paxos, Raft, and CASPaxos let the same proposer amortize the cost of becoming a leader by allowing it to continue committing values as long as it is uncontended. The key point is that leadership is about mutual exclusion: only one proposer at a time has the right to commit values. For CASPaxos, the gist of how this works is that every time you commit a value you piggyback leader election for the &lt;em&gt;next&lt;/em&gt; value in the same message: each &lt;code&gt;accept&lt;/code&gt; request has the next &lt;code&gt;prepare&lt;/code&gt; piggy-backed on.&lt;/p&gt;
&lt;p&gt;Leadered and leaderless consensus both have a one-round-trip (1 RTT) fast case, but they optimize for different failure and contention patterns. Choosing which is the right approach depends on the scenario - we&apos;ll talk about some cases later.&lt;/p&gt;
&lt;p&gt;Leadered consensus lets only the leader commit in 1 RTT. If another proposer wants to commit, it must first become the leader (1 RTT) and then commit. That&apos;s 2 RTT in total if you were to have a different proposer for each commit. If proposers are trying to commit values concurrently, they can end up contending, possibly indefinitely, trying to squeeze those 2 RTT in before the other proposer deposes them as leader. This is the &lt;em&gt;dueling proposers&lt;/em&gt; problem.&lt;/p&gt;
&lt;p&gt;Leaderless consensus algorithms allow any proposer to commit a value without first obtaining exclusive rights. This is the crux of the Fast Paxos optimization: acceptors are prepared ahead of time for a shared &lt;strong&gt;fast round&lt;/strong&gt; in which proposers can send &lt;code&gt;accept&lt;/code&gt; requests directly to acceptors instead of first performing a &lt;code&gt;prepare&lt;/code&gt; phase. If enough acceptors receive the &lt;strong&gt;same&lt;/strong&gt; value, it commits in a single round-trip. If concurrent proposers propose &lt;strong&gt;conflicting&lt;/strong&gt; values, the protocol falls back to classic recovery (&lt;code&gt;prepare&lt;/code&gt; then &lt;code&gt;accept&lt;/code&gt;). So Fast Paxos is leaderless in fast rounds but leadered during recovery.&lt;/p&gt;
&lt;p&gt;The cost of that leaderless fast path is a larger quorum. A later classic &lt;code&gt;prepare&lt;/code&gt; quorum must still be able to tell which value, if any, could have been committed in the fast round. In a classic ballot there is only one proposer, so the ballot identifies a single candidate value. In a fast ballot many proposers can race, so different acceptors may report different values for the same ballot. Recovery therefore has to group responses by value at the highest fast ballot it sees and choose the unique maximum, or any tied maximum. That is why fast rounds need a supermajority: not because the fast proposer needs extra votes for its own sake, but because later recovery must be unable to reinterpret the round as having decided a different value. Leaderless consensus is attractive in many scenarios, but those larger quorums and the risk of conflicts mean it generally performs worse than leadered consensus under contention, which is why the ability to switch between the two modes is appealing.&lt;/p&gt;
&lt;h3&gt;When each mode fits best&lt;/h3&gt;
&lt;h4&gt;Leadered&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High write rate.&lt;/strong&gt; If one proposer is likely to drive a long sequence of updates, leadered mode amortizes prepare costs across many operations instead of paying them repeatedly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High conflict rate.&lt;/strong&gt; When independent proposers frequently want different next values, a leader reduces repeated proposer-versus-proposer collisions and gives more stable progress than optimistic fast rounds. To take advantage of this, you need to route requests to a stable leader.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Leaderless&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Geo-distributed, low-conflict deployments.&lt;/strong&gt; When network latency is high, leaderless 1 RTT commits are particularly attractive, but the penalty of contention is more pronounced.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Infrequent writes.&lt;/strong&gt; When writes are infrequent and proposers have a chance to learn the latest committed value without going through consensus, fast rounds allow 1 RTT commits at any proposer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Almost-everywhere agreement.&lt;/strong&gt; When concurrent proposers are proposing the same value anyway, shared fast rounds let them proceed without dueling. The &lt;a href=&quot;https://www.usenix.org/conference/atc18/presentation/suresh&quot;&gt;Rapid cluster membership algorithm&lt;/a&gt; uses Fast Paxos for exactly this property: Rapid&apos;s cut detector produces proposals based on observer alerts and delays action until churn stabilizes into the same multi-process cut, resulting in very low conflict rates among proposers, allowing them to commit in 1 RTT most of the time without going through a leader.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Register initialization.&lt;/strong&gt; The initial fast ballot is implicitly prepared, so the first write can skip &lt;code&gt;prepare&lt;/code&gt; entirely and go straight to &lt;code&gt;accept&lt;/code&gt;. That makes 1 RTT initialization especially attractive when a system creates many lightweight registers which may only be written once.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When it&apos;s not clear-cut&lt;/h3&gt;
&lt;p&gt;I&apos;ve seen arguments saying that the smaller quorum requirements of leadered protocols confer advantages like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reduced chance that a straggler server will slow down consensus&lt;/li&gt;
&lt;li&gt;Improved failure masking (more servers can fail before progress stops)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those are not unreasonable ideas, but in a leadered protocol, the leader itself can be the straggler and slow down all requests (a gray failure). If the leader crashes you need to first detect that, which is inherently slow, and then elect a new leader. That handoff naturally creates a stutter where progress pauses while the system re-establishes mutual exclusion. The system becomes temporarily unavailable during this period. Leaderless algorithms don&apos;t have these problems. Given that, these attributes are not clear-cut.&lt;/p&gt;
&lt;h2&gt;CASPaxos&lt;/h2&gt;
&lt;p&gt;CASPaxos implements a linearizable rewritable register. Instead of replicating an entire log of commands like Raft and MultiPaxos do, each operation has a proposer read the current register value, apply an update function locally, and replicate the resulting value to a quorum of acceptors. That makes it a good fit for small-ish blobs of strongly consistent state such as configuration, leases, and membership metadata.&lt;/p&gt;
&lt;p&gt;Every update is protected by a ballot number. Before a proposer can overwrite the register, it has to learn the highest value which might still matter, so previously committed work cannot be clobbered blindly.&lt;/p&gt;
&lt;p&gt;The protocol is still classic Paxos-shaped: first &lt;code&gt;prepare&lt;/code&gt; a ballot, then &lt;code&gt;accept&lt;/code&gt; a value at that ballot. The proposer can piggyback the next ballot&apos;s &lt;code&gt;prepare&lt;/code&gt; onto its successful &lt;code&gt;accept&lt;/code&gt;, letting the same proposer stay on a 1 RTT steady-state path.&lt;/p&gt;
&lt;h3&gt;Role-local state&lt;/h3&gt;
&lt;h4&gt;Proposer state (volatile)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;id&lt;/code&gt;: unique proposer identifier&lt;/li&gt;
&lt;li&gt;&lt;code&gt;round&lt;/code&gt;: monotonically increasing, initially 0&lt;/li&gt;
&lt;li&gt;&lt;code&gt;prepared&lt;/code&gt;: most recently prepared ballot, initially &lt;code&gt;null&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cachedVal&lt;/code&gt;: last decided value, initially &lt;code&gt;null&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Acceptor state (persistent)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;promised&lt;/code&gt;: highest ballot promised&lt;/li&gt;
&lt;li&gt;&lt;code&gt;accepted&lt;/code&gt;: last accepted ballot, or &lt;code&gt;null&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;value&lt;/code&gt;: last accepted value, or &lt;code&gt;null&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Phase 1: prepare&lt;/h3&gt;
&lt;p&gt;The proposer picks a fresh ballot &lt;code&gt;b = (round, id)&lt;/code&gt; and sends &lt;code&gt;Prepare(b)&lt;/code&gt; to acceptors. An acceptor compares &lt;code&gt;b&lt;/code&gt; with its &lt;code&gt;promised&lt;/code&gt;. If &lt;code&gt;b&lt;/code&gt; is high enough, it raises &lt;code&gt;promised&lt;/code&gt; to &lt;code&gt;b&lt;/code&gt; and replies with &lt;code&gt;Promise(accepted, value)&lt;/code&gt;. Otherwise it replies with &lt;code&gt;Reject(promised)&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;To succeed, the proposer needs a quorum of &lt;code&gt;Promise&lt;/code&gt; responses. If some acceptor reports a higher ballot via &lt;code&gt;Reject(promised)&lt;/code&gt;, the proposer does &lt;strong&gt;not&lt;/strong&gt; get to ignore it and move on. It must move &lt;code&gt;round&lt;/code&gt; above the highest promised ballot it saw and retry.&lt;/p&gt;
&lt;p&gt;Once &lt;code&gt;prepare&lt;/code&gt; succeeds, the proposer recovers the latest value which might matter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if no acceptor in the quorum has accepted anything, then &lt;code&gt;cachedVal = ⊥&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;otherwise, the proposer sets &lt;code&gt;cachedVal&lt;/code&gt; to the value paired with the highest &lt;code&gt;accepted&lt;/code&gt; ballot it observed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That recovered &lt;code&gt;cachedVal&lt;/code&gt; is the safe input to the client&apos;s update function.&lt;/p&gt;
&lt;h3&gt;Phase 2: accept&lt;/h3&gt;
&lt;p&gt;After &lt;code&gt;prepare&lt;/code&gt;, the proposer computes &lt;code&gt;newVal = update(cachedVal)&lt;/code&gt; and sends &lt;code&gt;Accept(b, newVal, nextBallot)&lt;/code&gt; to acceptors.&lt;/p&gt;
&lt;p&gt;An acceptor rejects if &lt;code&gt;b&lt;/code&gt; is below its &lt;code&gt;promised&lt;/code&gt;. Otherwise it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;stores &lt;code&gt;accepted = b&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;stores &lt;code&gt;value = newVal&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;raises &lt;code&gt;promised = max(b, nextBallot)&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The value is committed once a quorum accepts it.&lt;/p&gt;
&lt;p&gt;That optional &lt;code&gt;nextBallot&lt;/code&gt; field is the piggybacked-&lt;code&gt;prepare&lt;/code&gt; optimization. Instead of doing:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;prepare&lt;/code&gt; this round&lt;/li&gt;
&lt;li&gt;&lt;code&gt;accept&lt;/code&gt; this round&lt;/li&gt;
&lt;li&gt;&lt;code&gt;prepare&lt;/code&gt; again next time&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;the proposer does:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;prepare&lt;/code&gt; this round&lt;/li&gt;
&lt;li&gt;&lt;code&gt;accept&lt;/code&gt; this round &lt;strong&gt;and&lt;/strong&gt; piggyback the next ballot&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;so the same proposer can cache &lt;code&gt;prepared = nextBallot&lt;/code&gt; and skip the next standalone &lt;code&gt;prepare&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&amp;lt;figure class=&quot;protocol-cheatsheet&quot;&amp;gt;
&amp;lt;div class=&quot;protocol-cheatsheet-columns&quot;&amp;gt;
&amp;lt;div class=&quot;protocol-cheatsheet-column&quot;&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/caspaxos-state.png&quot;
data-lightbox
data-lightbox-caption=&quot;CASPaxos cheatsheet: State&quot;
aria-label=&quot;Expand CASPaxos cheatsheet state box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/caspaxos-state.png&quot;
alt=&quot;CASPaxos cheatsheet box titled State&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/caspaxos-rules-for-proposers.png&quot;
data-lightbox
data-lightbox-caption=&quot;CASPaxos cheatsheet: Rules for Proposers&quot;
aria-label=&quot;Expand CASPaxos cheatsheet rules for proposers box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/caspaxos-rules-for-proposers.png&quot;
alt=&quot;CASPaxos cheatsheet box titled Rules for Proposers&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;/div&amp;gt;
&amp;lt;div class=&quot;protocol-cheatsheet-column&quot;&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/caspaxos-ballots.png&quot;
data-lightbox
data-lightbox-caption=&quot;CASPaxos cheatsheet: Ballots&quot;
aria-label=&quot;Expand CASPaxos cheatsheet ballots box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/caspaxos-ballots.png&quot;
alt=&quot;CASPaxos cheatsheet box titled Ballots&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/caspaxos-prepare-rpc.png&quot;
data-lightbox
data-lightbox-caption=&quot;CASPaxos cheatsheet: Prepare RPC&quot;
aria-label=&quot;Expand CASPaxos cheatsheet prepare RPC box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/caspaxos-prepare-rpc.png&quot;
alt=&quot;CASPaxos cheatsheet box titled Prepare RPC&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/caspaxos-accept-rpc.png&quot;
data-lightbox
data-lightbox-caption=&quot;CASPaxos cheatsheet: Accept RPC&quot;
aria-label=&quot;Expand CASPaxos cheatsheet accept RPC box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/caspaxos-accept-rpc.png&quot;
alt=&quot;CASPaxos cheatsheet box titled Accept RPC&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&amp;lt;figcaption&amp;gt;
CASPaxos cheatsheet
&amp;lt;/figcaption&amp;gt;
&amp;lt;/figure&amp;gt;&lt;/p&gt;
&lt;p&gt;The important bits are that &lt;code&gt;prepare&lt;/code&gt; is what protects prior work by forcing a proposer to learn the highest value which might still matter, &lt;code&gt;accept&lt;/code&gt; is the only step which actually commits a new value, and piggybacking the next ballot does not commit anything by itself; it only prepares the following round.&lt;/p&gt;
&lt;h2&gt;Fast CASPaxos&lt;/h2&gt;
&lt;p&gt;Everything above used classic proposer-owned ballots. Fast CASPaxos keeps the same proposer and acceptor roles, the same persistent acceptor state, and the same &lt;code&gt;prepare&lt;/code&gt;/&lt;code&gt;accept&lt;/code&gt; structure. The core changes relative to CASPaxos are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ballots with &lt;code&gt;proposerId = 0&lt;/code&gt; are shared fast ballots, so any proposer may use them&lt;/li&gt;
&lt;li&gt;a piggybacked next ballot can therefore prepare either the next proposer-owned classic ballot or the next shared fast ballot&lt;/li&gt;
&lt;li&gt;&lt;code&gt;prepare&lt;/code&gt; still uses a classic majority, but &lt;code&gt;accept&lt;/code&gt; at a fast ballot needs a larger fast quorum&lt;/li&gt;
&lt;li&gt;within a fast ballot, acceptors are first-write-wins: they accept the first value they see at that ballot and reject later different values&lt;/li&gt;
&lt;li&gt;if a later &lt;code&gt;prepare&lt;/code&gt; sees that the highest accepted ballot was fast, it tallies values at that ballot and carries forward the unique maximum, or any tied maximum&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is enough to turn CASPaxos into Fast CASPaxos without changing acceptor state. If two proposers send different values in the same fast ballot, neither reaches the fast quorum, so the protocol falls back to a classic recovery round which applies that tally rule. The cheatsheet and sketch below summarize the core protocol; the later subsections cover practical refinements and optimizations.&lt;/p&gt;
&lt;p&gt;Here is the corresponding Fast CASPaxos protocol summary from the paper:&lt;/p&gt;
&lt;p&gt;&amp;lt;figure class=&quot;protocol-cheatsheet&quot;&amp;gt;
&amp;lt;div class=&quot;protocol-cheatsheet-columns&quot;&amp;gt;
&amp;lt;div class=&quot;protocol-cheatsheet-column&quot;&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/fast-caspaxos-state.png&quot;
data-lightbox
data-lightbox-caption=&quot;Fast CASPaxos cheatsheet: State&quot;
aria-label=&quot;Expand Fast CASPaxos cheatsheet state box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/fast-caspaxos-state.png&quot;
alt=&quot;Fast CASPaxos cheatsheet box titled State&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/fast-caspaxos-rules-for-proposers.png&quot;
data-lightbox
data-lightbox-caption=&quot;Fast CASPaxos cheatsheet: Rules for Proposers&quot;
aria-label=&quot;Expand Fast CASPaxos cheatsheet rules for proposers box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/fast-caspaxos-rules-for-proposers.png&quot;
alt=&quot;Fast CASPaxos cheatsheet box titled Rules for Proposers&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;/div&amp;gt;
&amp;lt;div class=&quot;protocol-cheatsheet-column&quot;&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/fast-caspaxos-ballots.png&quot;
data-lightbox
data-lightbox-caption=&quot;Fast CASPaxos cheatsheet: Ballots&quot;
aria-label=&quot;Expand Fast CASPaxos cheatsheet ballots box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/fast-caspaxos-ballots.png&quot;
alt=&quot;Fast CASPaxos cheatsheet box titled Ballots&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/fast-caspaxos-prepare-rpc.png&quot;
data-lightbox
data-lightbox-caption=&quot;Fast CASPaxos cheatsheet: Prepare RPC&quot;
aria-label=&quot;Expand Fast CASPaxos cheatsheet prepare RPC box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/fast-caspaxos-prepare-rpc.png&quot;
alt=&quot;Fast CASPaxos cheatsheet box titled Prepare RPC&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;a
class=&quot;protocol-cheatsheet-card&quot;
href=&quot;/images/protocol-cheatsheets/fast-caspaxos-accept-rpc.png&quot;
data-lightbox
data-lightbox-caption=&quot;Fast CASPaxos cheatsheet: Accept RPC&quot;
aria-label=&quot;Expand Fast CASPaxos cheatsheet accept RPC box&quot;
&amp;gt;
&amp;lt;img
src=&quot;/images/protocol-cheatsheets/fast-caspaxos-accept-rpc.png&quot;
alt=&quot;Fast CASPaxos cheatsheet box titled Accept RPC&quot;
loading=&quot;lazy&quot;
/&amp;gt;
&amp;lt;/a&amp;gt;
&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&amp;lt;figcaption&amp;gt;
Fast CASPaxos cheatsheet
&amp;lt;/figcaption&amp;gt;
&amp;lt;/figure&amp;gt;&lt;/p&gt;
&lt;h3&gt;Protocol sketch&lt;/h3&gt;
&lt;p&gt;This is simplified to show the core protocol only. It omits duplicate-message handling, cached local views, and the best-effort learn path.&lt;/p&gt;
&lt;p&gt;Notation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ballots are &lt;code&gt;(round, proposerId)&lt;/code&gt;, where &lt;code&gt;proposerId = 0&lt;/code&gt; denotes the shared fast ballot for that round&lt;/li&gt;
&lt;li&gt;&lt;code&gt;prepareQuorum = classicQuorum = floor(N / 2) + 1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fastQuorum = ceil(3N / 4)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;acceptQuorumFor(ballot) = fastQuorum&lt;/code&gt; for fast ballots and &lt;code&gt;classicQuorum&lt;/code&gt; for classic ballots&lt;/li&gt;
&lt;li&gt;every &lt;code&gt;prepare&lt;/code&gt; response returns &lt;code&gt;(acceptedBallot, acceptedValue, maxBallot)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;prepare(b)&lt;/code&gt; succeeds at an acceptor exactly when the response reports &lt;code&gt;maxBallot = b&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Proposer&lt;/h4&gt;
&lt;pre&gt;&lt;code&gt;function propose(update):
    b := initial ballot to try

    loop:
        if b is not already prepared:
            responses := collect Prepare(b) responses
            if responses with maxBallot == b do not reach prepareQuorum:
                b := next classic ballot above the highest maxBallot returned
                continue

            currentValue := recoverValue(responses)
        else:
            currentValue := previously learned value for b

        newValue := update(currentValue)
        next := optional next ballot to piggyback

        responses := collect Accept(b, newValue, next) responses
        if accepts in responses reach acceptQuorumFor(b):
            return newValue

        b := next classic ballot above the highest conflicting ballot returned

function recoverValue(responses):
    highest := maximum acceptedBallot reported in the responses

    if highest == ⊥:
        return ⊥

    if highest is a classic ballot:
        return the value paired with highest

    // highest is a fast ballot, so several values may appear at that ballot
    votes := count values among responses whose acceptedBallot == highest
    return any value tied for the largest count
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;prepare&lt;/code&gt; always needs only a classic majority. The larger quorum is only for fast &lt;code&gt;accept&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A proposer only reuses &lt;code&gt;next&lt;/code&gt; if enough successful &lt;code&gt;accept&lt;/code&gt; responses confirm that the promise really landed.&lt;/li&gt;
&lt;li&gt;In practice, a proposer only skips &lt;code&gt;prepare&lt;/code&gt; when it already has a prepared ballot and enough local knowledge to compute &lt;code&gt;newValue&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Acceptor&lt;/h4&gt;
&lt;pre&gt;&lt;code&gt;state:
    promisedBallot := (1, 0)       // initial fast ballot is implicitly prepared
    acceptedBallot := ⊥
    acceptedValue := ⊥

function onPrepare(b):
    if b &amp;gt;= promisedBallot:
        promisedBallot := b
    reply prepareResponse(acceptedBallot, acceptedValue, promisedBallot)

function onAccept(b, v, next):
    maxBallot := max(promisedBallot, acceptedBallot)
    if b &amp;lt; maxBallot:
        reply reject(maxBallot)
        return

    // Optimization: idempotency from possibly multiple proposers
    if b == acceptedBallot and v == acceptedValue:
        reply accept(promisedBallot)
        return

    if b == acceptedBallot and v != acceptedValue:
        reply reject(maxBallot)
        return

    acceptedBallot := b
    acceptedValue := v

    if next != none:
        promisedBallot := max(promisedBallot, next)

    reply accept(promisedBallot)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Concurrently committing identical values&lt;/h3&gt;
&lt;p&gt;First-write-wins only rules out &lt;em&gt;different&lt;/em&gt; values at the same fast ballot. To avoid treating identical concurrent proposals as conflicts, acceptors still acknowledge an &lt;code&gt;accept&lt;/code&gt; request which exactly matches their current accepted ballot and value. That keeps almost-everywhere-agreement workloads efficient without changing state.&lt;/p&gt;
&lt;p&gt;In classic rounds, the same behavior also makes &lt;code&gt;accept&lt;/code&gt; retries idempotent.&lt;/p&gt;
&lt;h3&gt;Register initialization&lt;/h3&gt;
&lt;p&gt;To allow proposers to initialize a register in 1 RTT, they start with the &lt;code&gt;accept&lt;/code&gt; phase for the initial fast round, &lt;code&gt;(1, 0)&lt;/code&gt;. This way, either they can have their initial value committed if there is no contention, or they will learn of the conflicting ballot and can retry at the &lt;code&gt;prepare&lt;/code&gt; phase. Acceptors are initialized with a promised ballot of &lt;code&gt;(1, 0)&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Reusing a piggybacked ballot safely&lt;/h3&gt;
&lt;p&gt;Piggybacking a next ballot is only a latency optimization. A proposer may skip a standalone &lt;code&gt;prepare&lt;/code&gt; on its next operation only if a quorum of successful &lt;code&gt;accept&lt;/code&gt; responses confirms that the requested &lt;code&gt;next&lt;/code&gt; ballot was actually promised. If only some acceptors echoed it, the proposer discards that candidate ballot and falls back to &lt;code&gt;prepare&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Learning decided values&lt;/h3&gt;
&lt;p&gt;A proposer may send &lt;code&gt;Accept(b, v, next)&lt;/code&gt; only if it already knows the register value for &lt;code&gt;b&lt;/code&gt;. It can learn that value by running &lt;code&gt;prepare&lt;/code&gt;, by committing and reusing the piggybacked next ballot, or via a best-effort learn notification from another proposer.&lt;/p&gt;
&lt;p&gt;Learn traffic is only an optimization. If that notification is stale or missing, the proposer falls back to &lt;code&gt;prepare&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Do we even need classic rounds?&lt;/h3&gt;
&lt;p&gt;Fast CASPaxos could be generalized further to only use fast rounds, and never fall back to classic rounds. &lt;a href=&quot;https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf&quot;&gt;Distributed consensus revised&lt;/a&gt; contains the necessary relaxations/generalizations. On the other hand, having distinct leadered vs leaderless rounds confers higher fault tolerance and it&apos;s arguably less of a leap for people familiar with Classic Paxos already.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Fast CASPaxos is a small extension to CASPaxos that implements a leaderless linearizable register. It&apos;s conceptually a blend of Fast Paxos and CASPaxos. It&apos;s likely most useful for consistent group membership (eg, Rapid) and metadata replication (eg, &lt;a href=&quot;https://engineering.fb.com/2019/06/06/data-center-engineering/delos/&quot;&gt;Delos&lt;/a&gt; and other &lt;a href=&quot;https://medium.com/@adamprout/categorizing-how-distributed-databases-utilize-consensus-algorithms-492c8ff9e916&quot;&gt;Consensus for Metadata&lt;/a&gt; systems (which is a lot, btw)). I&apos;m happy to have scratched this itch that has been bugging me for a long time.&lt;/p&gt;
&lt;p&gt;I uploaded a draft PDF of a &lt;a href=&quot;https://github.com/ReubenBond/fast-caspaxos/blob/main/paper/fast-caspaxos.pdf&quot;&gt;paper&lt;/a&gt; on Fast CASPaxos. The accompanying &lt;a href=&quot;https://github.com/ReubenBond/fast-caspaxos&quot;&gt;repository&lt;/a&gt; also includes a &lt;a href=&quot;https://github.com/ReubenBond/fast-caspaxos/tree/main/tla&quot;&gt;TLA+ model checked with TLC&lt;/a&gt;, a deterministic simulation suite with Porcupine linearizability checking, and some toy &lt;a href=&quot;https://github.com/ReubenBond/fast-caspaxos#benchmarking&quot;&gt;benchmark workloads&lt;/a&gt; for various scenarios topologies. When reading the code, note that while I spent time on the core (proposer, acceptor, types, etc), a lot of what surrounds it is generated by coding agents, especially the benchmark harness.&lt;/p&gt;
</content:encoded><author>Reuben Bond</author></item></channel></rss>