dotnet/runtime

Memory<T> and large memory mapped files

hexawyz opened this issue · 27 comments

I'm currently experimenting with OwnedMemory<T> and Memory<T> in an existing project that I'm trying to improve, and I ran into an issue with OwnedMemory<T> and Memory<T> being limited to int.MaxValue.

Scenario

I have a relatively big (> 2GB) data file that I want to fully map in memory (i.e. a database). My API exposes methods that returns subsets of this big memory mapped file, e.g.

public ReadOnlyMemory<byte> GetBytes(int something)
{
    // …
    return mainMemory.Slice(start, length).AsReadOnly();
}

Wrapping the MemoryMappedFile and associated MemoryMappedViewAccessor into an OwnedMemory<byte> seemed to be a good idea, since most of the tricky logic would then be handled by the framework.

Problem

The memory block that I want to wrap is bigger than 2GB and cannot currently be represented by a single Memory instance.
Since Memory can only work with T[], string, or OwnedMemory<T>, it seems that having to give up on the straightfoward OwnedMemory<T> implementation also means that I have to give up on using Memory<T> at all.

(In this specific case, Span<T> being limited to 2GB, would not be a problem, because the sliced memory blocks that my API would return would always be much smaller than that.)

Possible solutions with the currently proposed API

  • Not using Memory<T> at all and implementing a much simplified version of OwnedMemory<T>/Memory<T> that would fit my use case
  • Keeping many overlapping instances of OwnedMemory<T> around and use the one that best fits the current case

Question

Would it be possible to improve the framework in order to be able of easily working with such large memory blocks? (Maybe implementing something like a BigMemory<T> ?)

We will be soon adding ReadOnlyBuffer. See https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.Primitives/System/Buffers/ReadOnlyBuffer.cs

We would be interested in your feedback on this type. Would it support your scenarios?

I took some time to look into this new type and I think I could make it work (haven't had the time to try it yet, though).
I quite like the idea of having a standardized buffer type, but I am a bit afraid about the induced complexity in a case where all memory is contiguous by design. (Especially in the case of the Seek operation)

In my current case, the file is approximately 3,5GB, so I could create 4 OwnedMemory<byte> of 1GB or less, backed up by their owner, and I would have to chain those block by implementing IMemoryList<byte> on them.
If I'm not mistaken, using ReadOnlyBuffer<byte> would mean that creating a Span<byte> for a small part of the buffer, instead of being an O(1) operation such as new Span<byte>(pointer + offset, length), would be a non-trivial O(log N) operation.

As soon as I have the time, I'll try creating a small benchmark for this use case, and compare possible implementations.

@pakrym, @davidfowl I think we could solve the O(log N) seek problem if IMemoryList<T> extended ISequence<T>. ISequence<T> has Seek and it could be implemented as O(1) on some specialized datastructures, e.g. and array of buffers of the same size.

N is the number of segments. So I don't see how this has a big impact if buffers are large.

As I said before I don't like two sources of Positions (ROB and IML)

If IMemoryList extends ISequence, there would not be two sources of position. There would only be APIs on ISequence (Start, TryGet, Seek)

What about ReadOnlyBuffer? It edits Index to put bit's into it, how would it know that IMemoryList does not rely on that bit? It's the same conversion as in previous IMemoryList redesign

I created a benchmark comparing approaches for accessing a large memory block:
https://github.com/GoldenCrystal/MemoryLookupBenchmark

I tried to get it as close as possible to my real use-case:

  • Find the index and length of the data (I cheated a bit by using constant-length items there)
  • Create a reference to that data for later use (e.g. Span<T>)
  • Copy the item to a buffer (e.g. for Sockets)

Assuming I didn't make any mistakes in the benchmark code, the numbers tell me that using ReadOnlyBuffer would be ~1.95 times slower than implementing a custom slice type:

BenchmarkDotNet=v0.10.12, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.192)
Intel Core i7-4578U CPU 3.00GHz (Haswell), 1 CPU, 4 logical cores and 2 physical cores
Frequency=2929690 Hz, Resolution=341.3330 ns, Timer=TSC
.NET Core SDK=2.1.4
  [Host]     : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT
Method Mean Error StdDev Scaled ScaledSD
'Copy a random item to the stack using a locally generated Span.' 160.455 ns 1.7740 ns 1.6594 ns 1.00 0.00
'Copy a random item to the stack using the custom implemented SafeBufferSlice<T> struct.' 168.540 ns 3.3838 ns 4.5172 ns 1.05 0.03
'Copy a random item to the stack using the ReadOnlyBuffer<T> struct.' 329.546 ns 3.3078 ns 3.0941 ns 2.05 0.03

I'm not sure how much implementing ISequence<T> would improve the performance there. I tend to think it would be difficult to match the performance reached by the more direct uses of (ReadOnly)Span<T>. 🤔

FYI: We are adding IMemoryList.GetPosition(long). It will enable O(1) random access on some IMemoryList implementations (implementations with uniform size segments).

cc: @pakrym

Using PR dotnet/corefx#27499

                                                  Method |       Mean |        Op/s | Scaled |
-------------------------------------------------------- |-----------:|------------:|-------:|
                                   'MM item. Local Span' | 148.554 ns | 6,731,567.6 |   1.00 |
                               'MM item. BufferSlice<T>' | 154.868 ns | 6,457,113.1 |   1.04 |
                'MM item. ReadOnlySequence<T> (current)' | 272.563 ns | 3,668,870.8 |   1.84 |
 'MM item. ReadOnlySequence<T> (PR dotnet/corefx#27455)' | 254.244 ns | 3,933,232.7 |   1.71 |
 'MM item. ReadOnlySequence<T> (PR dotnet/corefx#27499)' | 211.564 ns | 4,726,706.1 |   1.43 |

Improved to x1.43 off the local span. Code changes to benchmark to test hexawyz/MemoryLookupBenchmark#1

Bear in mind that SafeBufferSlice works directly off a pointer to create its Span so it wouldn't be able to be contained in the ReadOnlySequence data structure or return a ReadOnlyMemory as it doesn't use OwnedMemory, isn't an array or string.

Also ReadOnlySequence does bounds checking on Slice which the SafeBufferSlice doesn't do, it just adds the offset to the pointer and returns a Span of length - so its pretty unsafe.

*edit updated with tweaks

Update to benchmarks PR dotnet/corefx#27499 is doesn't scale badly for 100-1000 segments as shown below

                          Method |    Categories |        Mean |         Op/s | Scaled |
-------------------------------- |-------------- |------------:|-------------:|-------:|
 'ReadOnlySequence<T> (current)' |     1 segment |   103.83 ns |  9,630,807.9 |   1.00 |
       (PR dotnet/corefx#27455)' |     1 segment |    85.50 ns | 11,696,574.0 |   0.82 |
       (PR dotnet/corefx#27499)' |     1 segment |    74.30 ns | 13,458,594.1 |   0.72 |
                                 |               |             |              |        |
 'ReadOnlySequence<T> (current)' |  100 segments | 1,293.73 ns |    772,961.6 |   1.00 |
       (PR dotnet/corefx#27455)' |  100 segments |   969.20 ns |  1,031,774.4 |   0.75 |
       (PR dotnet/corefx#27499)' |  100 segments |   248.77 ns |  4,019,825.1 |   0.19 |
                                 |               |             |              |        |
 'ReadOnlySequence<T> (current)' | 1000 segments | 1,375.86 ns |    726,820.1 |   1.00 |
       (PR dotnet/corefx#27455)' | 1000 segments | 1,026.54 ns |    974,149.4 |   0.75 |
       (PR dotnet/corefx#27499)' | 1000 segments |   286.20 ns |  3,494,079.8 |   0.21 |
                                 |               |             |              |        |
                         Span<T> |       MM item |   147.97 ns |  6,758,249.9 |   0.54 |
                  BufferSlice<T> |       MM item |   152.01 ns |  6,578,374.7 |   0.56 |
 'ReadOnlySequence<T> (current)' |       MM item |   273.28 ns |  3,659,196.5 |   1.00 |
       (PR dotnet/corefx#27455)' |       MM item |   252.47 ns |  3,960,792.4 |   0.92 |
       (PR dotnet/corefx#27499)' |       MM item |   211.79 ns |  4,721,555.1 |   0.78 |

Also ReadOnlySequence does bounds checking on Slice which the SafeBufferSlice doesn't do, it just adds the offset to the pointer and returns a Span of length - so its pretty unsafe.

You're right about that… I just tried adding bounds checking before the creation of BufferSlice<T> to have a more fair comparison, and at least on my machine, it seems to actually increase the throughput 🤨

Method Mean Error StdDev Op/s Scaled Allocated
Span<T> 161.9 ns 1.951 ns 2.921 ns 6,178,403.9 0.52 0 B
BufferSlice<T> 151.8 ns 2.123 ns 3.178 ns 6,589,287.8 0.49 0 B
'BufferSlice<T> no Bounds Checking' 166.2 ns 1.419 ns 2.124 ns 6,015,079.6 0.54 0 B
'ReadOnlySequence<T> (current)' 310.0 ns 1.916 ns 2.868 ns 3,226,296.4 1.00 0 B

I may have made a mistake somewhere, or maybe it simply plays well with the JIT inlining, but I don't know what to conclude.

Anyway, good job with the improvements. The new results are great 🙂

Latest in dotnet/corefx#27499 is much closer still

                          Span<T> |       MM item |   145.45 ns |  6,875,297.6 |   0.55 |
                   BufferSlice<T> |       MM item |   147.68 ns |  6,771,233.9 |   0.55 |
   ReadOnlySequence<T> (previous) |       MM item |   266.73 ns |  3,749,147.7 |   1.00 |
    ReadOnlySequence<T> (current) |       MM item |   246.94 ns |  4,049,523.6 |   0.93 |
    ReadOnlySequence<T> (this PR) |       MM item |   198.30 ns |  5,042,838.1 |   0.74 |

Nice! These results are so close that I doubt the differences will matter outside of microsbenchmarks, i.e. once the program starts doing something interesting with the data in the buffers.

I am going to close this. If there is data showing that ROS still cannot support real apps with multi-segmented buffers, we can think how to improve the perf further. @GoldenCrystal thanks for bringing this scenario to our attention.

Copying conversation over from https://github.com/dotnet/coreclr/issues/5851#issuecomment-370276484

From @kstewart83:

What is the possibility of adding a Span/Memory constructor for working with memory mapped files? Currently, it looks like I have to have unsafe code in order to do this:

var dbPath = "test.txt";
var initialSize = 1024;
var mmf = MemoryMappedFile.CreateFromFile(dbPath);
var mma = mmf.CreateViewAccessor(0, initialSize).SafeMemoryMappedViewHandle;
Span<byte> bytes;
unsafe
{
    byte* ptrMemMap = (byte*)0;
    mma.AcquirePointer(ref ptrMemMap);
    bytes = new Span<byte>(ptrMemMap, (int)mma.ByteLength);
}

Also, it seems like I can only create Spans, as there aren't public constructors for Memory that take a pointer (maybe I'm missing the reason for this). But since the view accessors have safe memory handles that implement System.Runtime.InteropServices.SafeBuffer (i.e., they have a pointer and a length)...it seems natural to be able to leverage this for Span/Memory. So what would be nice is something like this:

var dbPath = "test.txt";
var initialSize = 1024;
var mmf = MemoryMappedFile.CreateFromFile(dbPath);
var mma = mmf.CreateViewAccessor(0, initialSize).SafeMemoryMappedViewHandle;
var mem = new Memory(mma);
var span = mem.Span.Slice(0, 512);

I also noticed that the indexer and internal length of Span uses int. With memory mapped files (especially for database scenarios) it is reasonable that the target file will exceed the upper limit for int. I'm not sure about the performance impact of long based indexing or if there is some magic way to have it both ways, but it would be convenient for certain scenarios.


From @kstewart83:

Unfortunately, looking at https://github.com/dotnet/corefx/issues/26603 along with the referenced code in the benchmarks didn't clear things up for me. It seems like that particular use case is geared to copying small bits of the memory mapped files into Spans and ReadOnlySegments. It looks like the solution still involves unsafe code with OwnedMemory<T>, which is what I'd like to avoid. I don't have experience with manual memory management in C#, so some of this is a little difficult to grasp. That's what I found appealing about Span/Memory is that I could now access additional performance and reduce/eliminate copying data around without the headache of manual memory management and the issues that come with it. It seems memory mapped files fit into target paradigm of Span/Memory (unifying the APIs around contiguous random access memory), so hopefully some type of integration of memory mapped files and Span/Memory makes it in at some point.


From @davidfowl:

@KrzysztofCwalina I think we should create something first class with Memory mapped files and the new buffer primitives (ReadOnlySequence).

@kstewart83 all we have right now are extremely low level primitives that you have to string together to make something work. That specific issue was about the performance gap between using Span directly and using the ReadOnlySequence (the gap has been reduced for that specific scenario).

Dealing with anything bigger than an int you'll need to use ReadOnlySequence<T> which is just a view over a linked list of ReadOnlyMemory<T>.

GSPP commented

It is not generally possible to slice large files into 1GB span segments. For example, a file could contain a large stream of small serialized items. Then, it's not possible to know where to cut the file. Slicing it could lead to torn items.

So it's no longer possible to create a span and pass it to some API of the form IEnumerable<MyItem> DeserializeStream(Span<byte> span) because the caller cannot know the slicing boundaries.

It would be really good if span supported long length. Some .NET users are already bumping against the 2GB array size limitations. For that reason the limit was increased to 2G items but that's only a short term remedy. As main memory sizes continue to grow any 2GB limit will make .NET look like ancient technology.

But I assume the int span length was consciously chosen... Unfortunately, I did not readily find a discussion about that but I would be interested to read it if somebody has a url to it at hand.

jnm2 commented

Wouldn't it be better for the API to be built to handle chunks and therefore work with streaming scenarios as well?

But I assume the int span length was consciously chosen... Unfortunately, I did not readily find a discussion about that but I would be interested to read it if somebody has a url to it at hand.

If I understand correctly, the problem here would be more with Memory<T> than with Span<T>:

The current version of Memory<T> packs nicely into 16 bytes on x64, while Span<T> seems to have room for replacing the int _length by IntPtr _length and still fitting into 8/16 bytes.
However, increasing the Lenght property of Span<T> requires doing the same with Memory<T>.
If I'm not mistaken, increasing the size of Memory<T> (from 16 bytes to 24 bytes) might have consequences on the performance of the code, which would impact everyone. (Not just those of us that are playing with large regions of memory)

It is true that in the case I presented, ReadOnlySequence<T> acts as a valid replacement for a 64 bits-enabled Memory<T> / Span<T>, because all I needed was to copy the data somewhere.
But when you need to read/decode without copying, the API might indeed be less straightforward. 🤔

I suspect though that since Memory<T> is allocated on the heap, the performance impacts would be different than say for Span<T>. Passing a Memory<T> object around shouldn't be any different, so I think the only performance impact would be in creating Span<T>s or maybe the fill routines?

A compelling use case I see with combining memory mapped files with Memory<T>/Span<T> is specifically to enable zero copy databases with only safe C#. It allows for a very understandable and uniform API by being able to present ReadOnly slices as well as ReadWrite slices. This could be combined with data formats such as FlatBuffers which don't require explicit parsing/unpacking to access the data.

Memory is not allocated on the heap (necessarily). It's a struct.

@KrzysztofCwalina, there is no API proposal for MMF Memory/Span overloads, should this issue be converted to api-needs-work. It will help downstream projects (serializers and other data computers etc,) waiting to update to .NET Core 2.1, if MMF also join the Span(t) and Memory(t) club. Thanks!

@kasper3, please open a separate issue for adding span support to MMF. This issue was about Memory's length property not being able to deal with large files.

@kasper3 @KrzysztofCwalina is there a separate issue for MMF/Span? I was not able to find it and is not linked here.

I am not aware.

@attilah, related https://github.com/dotnet/corefx/issues/29562#issuecomment-388182098 and overarching idea https://github.com/dotnet/corefx/issues/30174.
In case of MemoryMappedFile.CreateFromMemory, the file IO operation due to every .WriteX(..) would need to be replaced by memory IO operation. Use-case i was thinking was; user downloaded data file and without persisting to filesystem, file can be mapped to memory and sent back on wire. If you have better ideas how the API should be structured in terms of competing/related proposals, please send a proposal.

Sorry to be late to this, but it is not very clear to me from the above what is currently the recommended way to turn a MemoryMappedFile into a ReadOnlySequence<byte> (or ReadOnlySpan<byte>)?

sakno commented

@miloush , you can use third-party library. ReadOnlySequenceAccessor is probably what you need.