dotnet/runtime

API Proposal: Add a ValueStringBuilder

JeremyKuhne opened this issue ยท 60 comments

We should consider making a value based StringBuilder to allow low allocation building of strings. While Span allows you to provide a writable buffer, in many scenarios we have a need to get or build strings and we don't know precisely how much space will be needed ahead of time. Having an abstraction that can grow beyond a given initial buffer is particularly useful as it doesn't require looping with Try* APIs- which can be both complicated and have negative performance implications.

We currently use ValueStringBuilder for this purpose internally. It starts with an optional initial buffer (which we often stackalloc) and will grow using ArrayPool if needed.

Design Goals

  1. Allow safe usage of stack memory
  2. Use pooled memory when needed to reduce GC pressure
  3. Allow dynamic and explicit capacity growth
  4. Facilitate interop scenarios (i.e. passing as char* szValue)
  5. Follow API semantics of StringBuilder & string where possible
  6. Be stack allocated

API

Here is the proposed API:

namespace System.Text
{
    public ref struct ValueStringBuilder
    {
        public ValueStringBuilder(Span<char> initialBuffer);

        // The logical length of the builder (end of the "string")
        public int Length { get; set; }

        // Available space in chars
        public int Capacity { get; }

        // Ensure there is at least this amount of space
        public void EnsureCapacity(int capacity);

        // Get a pinnable reference to the builder. "terminate" ensures the builder has a null char after Length in the buffer.
        public ref char GetPinnableReference(bool terminate = false);

        // Indexer, allows setting/getting individual chars
        public ref char this[int index] { get; }

        // Returns a string based off of the current position
        public override string ToString();

        // Returns a span around the contents of the builder. "terminate" ensures the builder has a null char after Length in the buffer.
        public ReadOnlySpan<char> AsSpan(bool terminate);

        // To ensure inlining perf, we have a separate overload for terminate
        public ReadOnlySpan<char> AsSpan();

        public bool TryCopyTo(Span<char> destination, out int charsWritten);

        public void Insert(int index, char value, int count = 1);
        public void Insert(int index, ReadOnlySpan<char> value, int count = 1);

        public void Append(char c, int count = 1);
        public void Append(ReadOnlySpan<char> value);

        // This gives you an appended span that you can write to
        public Span<char> AppendSpan(int length);

        // Returns any ArrayPool buffer that may have been rented
        public void Dispose()
    }
}

This is the current shape of our internal ValueStringBuilder:

namespace System.Text
{
    internal ref struct ValueStringBuilder
    {
        public ValueStringBuilder(Span<char> initialBuffer);
        public int Length { get; set; }
        public int Capacity { get; }
        public void EnsureCapacity(int capacity);

        /// <summary>
        /// Get a pinnable reference to the builder.
        /// </summary>
        /// <param name="terminate">Ensures that the builder has a null char after <see cref="Length"/></param>
        public ref char GetPinnableReference(bool terminate = false);
        public ref char this[int index] { get; }

        // Returns a string based off of the current position
        public override string ToString();

        /// <summary>
        /// Returns a span around the contents of the builder.
        /// </summary>
        /// <param name="terminate">Ensures that the builder has a null char after <see cref="Length"/></param>
        public ReadOnlySpan<char> AsSpan(bool terminate);

        // To ensure inlining perf, we have a separate overload for terminate
        public ReadOnlySpan<char> AsSpan();

        public bool TryCopyTo(Span<char> destination, out int charsWritten);
        public void Insert(int index, char value, int count);
        public void Append(char c);
        public void Append(string s);
        public void Append(char c, int count);
        public unsafe void Append(char* value, int length);
        public void Append(ReadOnlySpan<char> value);

        // This gives you an appended span that you can write to
        public Span<char> AppendSpan(int length);

        // Returns any ArrayPool buffer that may have been rented
        public void Dispose()
    }
}

Sample Code

Here is a common pattern on an API that could theoretically be made public if ValueStringBuilder was public:
(Although we would call this one GetFullUserName or something like that.)

https://github.com/dotnet/corefx/blob/050bc33738887d9d8fcc9bc5965b7d9ca65bc7f4/src/System.Runtime.Extensions/src/System/Environment.Win32.cs#L40-L56

The caller is above this method:

https://github.com/dotnet/corefx/blob/050bc33738887d9d8fcc9bc5965b7d9ca65bc7f4/src/System.Runtime.Extensions/src/System/Environment.Win32.cs#L13-L38

Usage of AppendSpan:

https://github.com/dotnet/corefx/blob/3538128fa1fb2b77a81026934d61cd370a0fd7f5/src/System.Runtime.Numerics/src/System/Numerics/BigNumber.cs#L550-L560

I'll add more usage details and possible API surface area.

Notes

  • Does using this remove the need for having a bool TryGet*(Span) overload? (see https://github.com/dotnet/coreclr/pull/17097/files#r176560435)
  • Should we allow you to make it non-growable? (So we can avoid TryGet above?)
  • Can we get C# to allow using ref structs that have a Dispose() in a using statement? (dotnet/csharplang#93)
  • Do we care about AppendFormat overloads?
  • AppendSpan() is a little tricky to grok- is there a better term/pattern?

bool terminate

I didn't see that we'd added these arguments. I don't think we should have them. If you want to null terminate, you can call Append('\0').

I didn't see that we'd added these arguments. I don't think we should have them. If you want to null terminate, you can call Append('\0').

Discussed this one at length with @jkotas in dotnet/coreclr#16921. The problem with appending a null is that it changes the Length and breaks the convention on what Length means for strings & StringBuilder (i.e. doesn't include the terminator). It is incredibly error prone as a result (as my need for the mentioned pull demonstrates).

StringBuilder always terminates, but doing this comes at a cost. It is only really needed in interop calls that take a sz (most Win32).

The problem with appending a null is that it changes the Length and breaks the convention on what Length means for strings & StringBuilder (i.e. doesn't include the terminator).

The bool terminate is also changing the Length.

The bool terminate is also changing the Length.

No, it leaves Length as is. It might increase Capacity.

Hmm, I see, ok.

Are they needed if there is an overload taking ROSpan?

public void Append(string s);
public unsafe void Append(char* value, int length);

Are they needed if there is an overload taking ROSpan?

No, they aren't. I'll add a proposed section.

Can we get C# to allow using ref structs that have a Dispose() in a using statement?

There is a proposal dotnet/csharplang#93
It would be even better with this one dotnet/csharplang#114

There are several related efforts in corfxlab. We should try to unify (or make them consistent). See:
https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Writer/BufferWriter.cs
https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Writer/BufferWriterT_ints.cs
https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Writer/BufferWriter_strings.cs
And here support for non-allocating composite formats: https://github.com/dotnet/corefxlab/blob/master/src/System.Text.Formatting/System/Text/Formatting/CompositeFormat.cs

This ValueStringBuilder is a specialized BufferWriter. And so if we add it, we need to rationalize it with BufferWriter, i.e. we should not have both BufferWriter and ValueStringBuilder in .NET.

BufferWriter (the non-generic one) is a byref struct that allows writing (and resizing) to a span buffer. Today, it writes Utf8, but you could imagine writing either Utf8 or utf16 based on some setting. After the buffer is written, tere could be ToString or something like that to produce string result.

@KrzysztofCwalina thanks for the links! I've updated the text with design goals and will review the other efforts.

Should we allow you to make it non-growable? (So we can avoid TryGet above?)

// some reason it isn't, we'll grow the buffer.

In case of growable could be useful to have a boolean properties to know if VSB has grown? In a hot path could be useful avoid to add pressure to underlying ArrayPool.Shared(today) and improve performance by choosing the right size of initial buffer.

Some questions, I guess...

public void Insert(int index, char value, int count = 1);
public void Insert(int index, ReadOnlySpan<char> value, int count = 1);
public void Append(char c, int count = 1);
public void Append(ReadOnlySpan<char> value);
  • Are these the only methods to build with, or will there be more parity with the existing StringBuilder API (AppendLine, string parameter, et al.) and those are omitted from the post for brevity?
  • Is there a reason these don't return the this instance to enable fluent building?

Speaking of related efforts... There's also the InplaceStringBuilder from Microsoft.Extensions.Primitives, which is used in MVC and the HTTP abstractions. Might want to sync up with them as well.

// @pakrym @Tratcher

In case of growable could be useful to have a boolean properties to know if VSB has grown?

Perhaps we should add an EventSource log for this?

@Joe4evr

Are these the only methods to build with, or will there be more parity with the existing StringBuilder API (AppendLine, string parameter, et al.) and those are omitted from the post for brevity?

These were the most critical/obvious (based on current usage). I didn't want to start the discussion with too much complexity. Attacking this in priority order seemed like it would go a little quicker.

Is there a reason these don't return the this instance to enable fluent building?

Hadn't considered it is all. I'll take a look.

fluent building

Fluent pattern adds overhead, and it does not work for structs that needs to be disposed.

Fluent pattern adds overhead, and it does not work for structs that needs to be disposed.

Does it, now? Surely, it's less overhead than display class allocation or something.

Combining two of @khellang's comment https://github.com/dotnet/corefx/issues/20378#issuecomment-304400894 & https://github.com/dotnet/corefx/issues/28379#issuecomment-375975366, it would be super helpful for almost all the .NET developers if we get a comprehensive blog article or Channel 9 video on all these newly optimized string related types in .NET Core and ASP.NET Core detailing when to use which and why. During that exercise if some overlapping types are found, please consider deprecating one. With every upcoming string-related type, it is getting scarier. ๐Ÿ˜ฑ

it would be super helpful [...] if we get a comprehensive blog article or Channel 9 video on all these newly optimized string related types [...] detailing when to use which and why.

I think the general guidance (once Span<T> and friends officially hit the streets) is to use Span<T> (or ReadOnlySpan<T>, Memory<T>, ReadOnlyMemory<T> etc.) whenever you can.

StringSegment, ArraySegment<T> and various other related types (some of which has been around for some time) are all efforts to achieve some of the same benefits as the System.Memory types. At this point, I think you could argue they should all be viewed as "obsolete" or "redundant".

No, but with the other new APIs (on ReadOnlySpan<char>), it's arguable how much value yet another slice-of-string type would bring vs the overhead of the type existing at all. So I would say "redundant" ๐Ÿ˜‰

๐Ÿ˜• The API proposal itself clearly states:

but spans have different type system constraints that not all consumers can -- or want to -- opt into.

I'm assuming @terrajobst was referring to the stack-only-ness of ReadOnlySpan<T>. That doesn't apply to ReadOnlyMemory<T>.

There might be valid reasons for yet another slice-of-string type, but as I mentioned in https://github.com/dotnet/corefx/issues/20378#issuecomment-304400894, the number of these types are getting a bit out of hand and it's hard to know which to use and which to support in libraries.

The key thing we're missing that this addresses is a low allocation expandable Span<T> provider. We could go super simple with something like ref SpanProvider<T> cutting all string-like methods and perhaps just providing the functionality as extensions. I'm not fundamentally opposed to going that route, but the fact is that we do provide StringBuilder. It is a concept that people are familiar with and we need a mechanism to replace it for performance sensitive code paths. It seems less burdensome to me to juggle StringBuilder/ValueStringBuilder than StringBuilder/SpanProvider<char>, particularly as we're already moving down the road of having Value* value type equivalents of other reference types (e.g. it is an emergent pattern).

fanoI commented

I have a question regarding the Dispose() of this structure it should be always called?
If so the posted examples are not risking to leaking memory?

I have a question regarding the Dispose() of this structure it should be always called?

It returns any potentially rented buffers to the array pool. ToString() internally also returns the buffer, I'll call that out. On a related note, we don't usually use these with try ... finally blocks as in many cases we don't expect throws to be common enough to warrant taking the performance hit of introducing the block- we're ok with not returning the buffer to the pool in those cases (which will get it garbage collected).

Note that I'm exploring alternative options utilizing some of the concepts @KrzysztofCwalina brought up (with @KrzysztofCwalina). We'll circle back around with more info.

@JeremyKuhne have you made progress on exploring new options? I think we should definitely get this into 3.0.

have you made progress on exploring new options?

dotnet/corefxlab#2177 is part of what might negate the need for this. I currently own productizing BufferReader/Writer (dotnet/corefxlab#2166) and have just started in earnest.

I'll be experimenting with how I can provide the same functionality and taking a look at any tradeoffs. ValueStringBuilder is the backup plan if I can't provide a reasonable solution with the other new types.

I think we should definitely get this into 3.0.

I intend to get this or an equivalent in. I really want it as well as not having this sort of type is blocking adding other APIs that I'm very interested in seeing. :)

I'd expect that the API has a parity with StringBuilder:

  • AppendFormat
  • AppendJoin
  • AppendLine
  • Clear
  • Remove
  • Replace

I'd expect that the API has a parity with StringBuilder

I don't believe that should be a goal. They're meant for different uses and therefore don't need perfect symmetry. For example, ValueStringBuilder is meant for situations where you're trying to minimize allocation overhead as much as possible, and in such situations, you're not going to want to use a method that takes a params object[].

I think more about what users will replace StringBuilder with ValueStringBuilder in existing codes and can have difficulty due to the lack of some methods.

Not params methods, but Clear() is reasonable, and also Remove and Replace given it already has O(n) methods like Insert.

AppendFormat() does a lot of work (and equally requires a lot of code to implement). This type is more narrowly targeted at scenarios where you want to accept a bit more complexity for the most possible performance, in that scenario you are likely to handle any formatting yourself.

@iSazonov Having format methods on this in the future is definitely something I'm looking at. I'll be putting a prototype of one in corefxlab very soon. Doing formatting in a non-allocating way is a little tricky given boxing, etc. As such it will take a little while to shake out and I don't want to hold up the core functionality here. I'll make sure to tag you on the prototype PR.

StringBuilder is optimized for convenience and broad range of scenarios. It will continue to work better than ValueStringBuilder in many case.

ValueStringBuilder is not optimized for convenience. It is optimized for building small strings (<1kB) where the work required to format the string is small and fixed StringBuilder overhead is too high.

where the work required to format the string is small

Could you please clarify that do you mean? I'd expect that formatting is intended primarily for people and result string will definitely be smaller 1 Kb like in example I pointed above.

Formatting strings and string interpolation as implemented in .NET today are convenience that you pay quite a bit for.

String formatting/interpolation that is both convenient to use and efficient as handwritten formatting would require C# compiler changes, I think.

Video

We believe that without a feature that forces consumers to pass value types by reference (either language feature or in-box analyzer) this will cause unreliable applications due to corruption of the array pool when accidental copies are being made.

@JeremyKuhne do you want to drive that discussion with @jaredpar?

@JeremyKuhne do you want to drive that discussion with @jaredpar?

Yes, we need to have a way to force structs to be only passed by reference or (possibly) stack only classes. One possibility for forced by-ref structs that has been raised is ref local.

One possibility for forced by-ref structs that has been raised is ref local.

I'm sad it's not ref ref ๐Ÿ˜‰

Since the language team has expressed interest in allowing more features that might need new runtime constructs, it seems best to consider "Non-Copyable structs" (or maybe "ref Fields" instead) on a list of features that may warrant future-CLR enforcement.

A few months back I implemented basically this, thinking it was a good idea as well. Exact implementation will cause performance to vary, so take it with a grain of salt, but I was found was roughly:

For allocations of 5 chars long, less than 12 allocations was faster to use the traditional StringBuilder (tSB). For allocations less than 5 chars long, it was almost always faster to use the tSB. For allocations more than 10 chars long, it was almost always faster to use the tSB. For allocations of 5 chars long, only between 12 to 41 allocations were faster with the value/ref StringBuilder (vSB). For allocations of 8 chars long, only between 18 to 22 allocations were faster for the vSB.

Hopefully you're getting the picture without me needing to clone the old commit and recreate the 3D graphs. The amount of situations where the vSB was actually higher performing were extremely limited, and not clearly or obviously defined.

Furthmore, in every single code base I have where this might have been useful, I worked out ways to just use an array/buffer, where even additional computations or value passing still outperformed the vSB in every single instance.

So, it might be useful, but as a heads up to whoever implements this, you're gonna need to benchmark the ever living **** out of it, and document the small hole where it's actually useful.

@Entomy by faster, presumably you mean allocation. There is also the garbage to consider.

@danmosemsft Right, absolutely. That's basically the point I'm trying to make, is that this is complicated and it seems like there's only a very small, and unusual, range in which this is viable.

there's only a very small, and unusual

Reduction in allocation often has little-to-no measurable impact on serialized benchmark throughput. The benefit of allocation reduction when it comes to throughput is generally about the whole app running at scale.

This is marked blocked because it potentially requires language features. We need a way to enforce that structs are passed only by reference so that we don't double return arrays to the pool.

Anything slated for language changes to allow this? I think forcing ref structs makes sense as a new language construct.

Will it come out in 6.0.0?

For now I exposed my own by essentially shamelessly copying it:
https://github.com/modernuo/ModernUO/blob/main/Projects/Server/Buffers/ValueStringBuilder.cs

Also you can use https://github.com/Cysharp/ZString which has a great StringBuilder-like API.

This is essentially blocked on resolving discussion in #50389.

Since roslyn adds support for utf8 strings handled as ReadOnlySpan<byte>, i think it will be better to have something like ValueSpanBuilder<T> instead.

Has been anything further done in this regard, i ask out of curiosity

lsoft commented

ValueStringBuilder is in runtime now, as I can see: https://github.com/dotnet/runtime/blob/main/src/libraries/Common/src/System/Text/ValueStringBuilder.cs

but not public for some reason. what is that reason? public ValueStringBuilder is the key to provide, for example, global::System.Net.WebUtility.HtmlDecode(ReadOnlySpan<char>) into the public space.

but not public for some reason. what is that reason?

#25587 (comment)

Just saw another copy of this type pop up, and see that we currently have 20+ copies in the product. https://source.dot.net/#q=ValueStringBuilder

Is there any progress on the language feature that would make us comfortable exposing this type?

and see that we currently have 20+ copies in the product

To clarify (I'm not sure if this is what you meant or not), we have that many in binaries, not that many source copies, with the same source built as internal into many places.