High performance, thread-safe in-memory caching primitives for .NET.
Install-Package BitFaster.Caching
Class | Description |
---|---|
ConcurrentLru | Represents a thread-safe bounded size pseudo LRU. A drop in replacement for ConcurrentDictionary, but with bounded size. Maintains psuedo order, with better hit rate than a pure Lru and not prone to lock contention. |
ConcurrentTLru | Represents a thread-safe bounded size pseudo TLRU, items have TTL. As ConcurrentLru, but with a time aware least recently used (TLRU) eviction policy. If the values generated for each key can change over time, ConcurrentTLru is eventually consistent where the inconsistency window = TTL. |
SingletonCache | Represents a thread-safe cache of key value pairs, which guarantees a single instance of each value. Values are discarded immediately when no longer in use to conserve memory. |
Scoped | Represents a thread-safe wrapper for storing IDisposable objects in a cache that may dispose and invalidate them. The scope keeps the object alive until all callers have finished. |
ConcurrentLru
and ConcurrentTLru
are intended as a drop in replacement for ConcurrentDictionary
, and a much faster alternative to the System.Runtime.Caching.MemoryCache
family of classes (e.g. HttpRuntime.Cache
, System.Web.Caching
etc).
Choose a capacity and use just like ConcurrentDictionary:
int capacity = 666;
var lru = new ConcurrentLru<int, SomeItem>(capacity);
var value = lru.GetOrAdd(1, (k) => new SomeItem(k));
var value = await lru.GetOrAddAsync(0, (k) => Task.FromResult(new SomeItem(k)));
All cache classes in BitFaster.Caching own the lifetime of cached values, and will automatically dispose values when they are evicted.
To avoid races using objects after they have been disposed by the cache, wrap them with Scoped
. The call to CreateLifetime
creates a Lifetime
that guarantees the scoped object will not be disposed until the lifetime is disposed. Scoped
is thread safe, and guarantees correct disposal for concurrent lifetimes.
int capacity = 666;
var lru = new ConcurrentLru<int, Scoped<SomeDisposable>>(capacity);
var valueFactory = new SomeDisposableValueFactory();
using (var lifetime = lru.GetOrAdd(1, valueFactory.Create).CreateLifetime())
{
// lifetime.Value is guaranteed to be alive until the lifetime is disposed
}
class SomeDisposableValueFactory
{
public Scoped<SomeDisposable>> Create(int key)
{
return new Scoped<SomeDisposable>(new SomeDisposable(key));
}
}
SingletonCache
enables mapping every key to a single instance of a value, and keeping the value alive only while it is in use. This is useful when the total number of keys is large, but few will be in use at any moment.
The example below shows how to implement exclusive Url access using a lock object per Url.
var urlLocks = new SingletonCache<Url, object>();
Url url = new Url("https://foo.com");
using (var lifetime = urlLocks.Acquire(url))
{
lock (lifetime.Value)
{
// exclusive url access
}
}
MemoryCache is perfectly servicable, but it has some limitations:
- Makes heap allocations when the native object key is not type string.
- Is not 'scan' resistant, fetching all keys will load everything into memory.
- Does not scale well with concurrent writes.
- Contains perf counters that can't be disabled
- Uses an heuristic to estimate memory used, and evicts items using a timer. The 'trim' process may remove useful items, and if the timer does not fire fast enough the resulting memory pressure can be problematic (e.g. induced GC).
The cache replacement policy must maximize the cache hit rate, and minimize the computational and space overhead involved in implementing the policy. Below an analysis of hit rate vs cache size, latency and throughput is provided.
The charts below show the relative hit rate of classic LRU vs Concurrent LRU on a Zipfian distribution of input keys, with parameter s = 0.5 and s = 0.86 respectively. If there are N items, the probability of accessing an item numbered i or less is (i / N)^s.
Here N = 50000, and we take 1 million sample keys. The hit rate is the number of times we get a cache hit divided by 1 million. This test was repeated with the cache configured to different sizes expressed as a percentage N (e.g. 10% would be a cache with a capacity 5000).
As above, but interleaving a sequential scan of every key (aka sequential flooding). In this case, ConcurrentLru performs better across the board, and is more resistant to scanning.
These charts summarize the percentage increase in hit rate ConcurrentLru vs LRU. Increase is in hit rate is significant at lower cache sizes.
In these benchmarks, a cache miss is essentially free. These tests exist purely to compare the raw execution speed of the cache bookkeeping code. In a real setting, where a cache miss is presumably quite expensive, the relative overhead of the cache will be very small.
Benchmarks are based on BenchmarkDotNet, so are single threaded. The ConcurrentLru family of classes are composed internally of ConcurrentDictionary.GetOrAdd and ConcurrentQueue.Enqueue/Dequeue method calls, and scale well to concurrent workloads.
All benchmarks below are run on the same computer:
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.264 (2004/?/20H1)
Intel Core i7-5600U CPU 2.60GHz (Broadwell), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=3.1.301
[Host] : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
RyuJitX64 : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
Job=RyuJitX64 Jit=RyuJit Platform=X64
These are classes that execute with the hit counting logic eliminated (via JIT). If hit counts are not required, this makes the code around 10% faster.
Take 1000 samples of a Zipfian distribution over a set of keys of size N and use the keys to lookup values in the cache. If there are N items, the probability of accessing an item numbered i or less is (i / N)^s.
s = 0.86 (yields approx 80/20 distribution)
N = 500
Cache size = N / 10 (so we can cache 10% of the total set). ConcurrentLru has approximately the same computational overhead as a standard LRU in this single threaded test.
Method | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|
ClassicLru | 175.7 ns | 2.75 ns | 2.43 ns | 1.00 | 0.00 |
FastConcurrentLru | 180.2 ns | 2.55 ns | 2.26 ns | 1.03 | 0.02 |
ConcurrentLru | 189.1 ns | 3.14 ns | 2.94 ns | 1.08 | 0.03 |
FastConcurrentTLru | 261.4 ns | 4.53 ns | 4.01 ns | 1.49 | 0.04 |
ConcurrentTLru | 266.1 ns | 3.96 ns | 3.51 ns | 1.51 | 0.03 |
In this test the same items are fetched repeatedly, no items are evicted. Representative of high hit rate scenario, when there are a low number of hot items.
- ConcurrentLru family does not move items in the queues, it is just marking as accessed for pure cache hits.
- Classic Lru must maintain item order, and is internally splicing the fetched item to the head of the linked list.
- MemoryCache and ConcurrentDictionary represent a pure lookup. This is the best case scenario for MemoryCache, since the lookup key is a string (if the key were a Guid, using MemoryCache adds string conversion overhead).
FastConcurrentLru does not allocate and is approximately 10x faster than System.Runtime.Caching.MemoryCache or the newer Microsoft.Extensions.Caching.Memory.MemoryCache.
Method | Mean | Error | StdDev | Ratio | Gen 0 | Allocated |
---|---|---|---|---|---|---|
ConcurrentDictionary | 16.76 ns | 0.322 ns | 0.285 ns | 1.00 | - | - |
FastConcurrentLru | 18.94 ns | 0.249 ns | 0.220 ns | 1.13 | - | - |
ConcurrentLru | 21.46 ns | 0.204 ns | 0.191 ns | 1.28 | - | - |
FastConcurrentTLru | 41.57 ns | 0.450 ns | 0.376 ns | 2.48 | - | - |
ConcurrentTLru | 43.95 ns | 0.588 ns | 0.521 ns | 2.62 | - | - |
ClassicLru | 67.62 ns | 0.901 ns | 0.799 ns | 4.03 | - | - |
RuntimeMemoryCacheGet | 279.70 ns | 3.825 ns | 3.578 ns | 16.70 | 0.0153 | 32 B |
ExtensionsMemoryCacheGet | 341.67 ns | 6.617 ns | 6.499 ns | 20.35 | 0.0114 | 24 B |
In this test, we generate 2000 samples of 500 keys with a Zipfian distribution (s = 0.86). Caches have size 50. From N concurrent threads, fetch the sample keys in sequence (each thread is using the same input keys). The principal scalability limit in concurrent applications is the exclusive resource lock. As the number of threads increases, ConcurrentLru significantly outperforms an LRU implemented with a short lived exclusive lock used to synchronize the linked list data structure.
This test was run on a Standard D16s v3 Azure VM (16 cpus), with .NET Core 3.1.
TemplateConcurrentLru features injectable behaviors defined as structs. Structs are subject to special JIT optimizations, and the .NET JIT compiler can inline, eliminate dead code and propogate JIT time constants based on structs. Using this technique, the TemplateConcurrentLru can be customized to support LRU and TLRU policies without compromising execution speed.
This is the source code for the TryGet method. It calls into two value type generic type arguments: policy (1) and hitcounter (2).
public bool TryGet(K key, out V value)
{
I item;
if (dictionary.TryGetValue(key, out item))
{
return GetOrDiscard(item, out value);
}
value = default(V);
this.hitCounter.IncrementMiss();
return false;
}
// AggressiveInlining forces the JIT to inline policy.ShouldDiscard(). For LRU policy
// the first branch is completely eliminated due to JIT time constant propogation.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private bool GetOrDiscard(I item, out V value)
{
if (this.policy.ShouldDiscard(item)) // 1
{
this.Move(item, ItemDestination.Remove);
this.hitCounter.IncrementMiss(); // 2
value = default(V);
return false;
}
value = item.Value;
this.policy.Touch(item); // 1
this.hitCounter.IncrementHit(); // 2
return true;
}
The LruPolicy used by FastConcurrentLru/ConcurrentLru is hardcoded to never discard items.
public readonly struct LruPolicy<K, V> : IPolicy<K, V, LruItem<K, V>>
{
...
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public bool ShouldDiscard(LruItem<K, V> item)
{
return false;
}
...
}
The branch and enclosed code for ShouldDiscard are completely eliminated by JIT (1). Since the below code is from FastConcurrentLru, the hit counting calls (2) are also eliminated. The assembly code for the FastConcurrentLru.TryGet method with LruPolicy and NullHitCounter is 76 bytes:
; BitFaster.Caching.Lru.TemplateConcurrentLru`5[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[BitFaster.Caching.Lru.LruPolicy`2[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]], BitFaster.Caching],[BitFaster.Caching.Lru.NullHitCounter, BitFaster.Caching]].TryGet(Int32, Int32 ByRef)
push rsi
sub rsp,30
xor eax,eax
mov [rsp+28],rax
mov rsi,r8
mov rcx,[rcx+8]
lea r8,[rsp+28]
cmp [rcx],ecx
call qword ptr [7FFED15190A0]
test eax,eax
je short M01_L00
mov rax,[rsp+28]
mov eax,[rax+0C]
mov [rsi],eax
mov rax,[rsp+28]
mov byte ptr [rax+10],1
mov eax,1
add rsp,30
pop rsi
ret
M01_L00:
xor eax,eax
mov [rsi],eax
add rsp,30
pop rsi
ret
; Total bytes of code 76
The struct TLruLongTicksPolicy used in FastConcurrentTLru can expire items.
public readonly struct TLruLongTicksPolicy<K, V> : IPolicy<K, V, LongTickCountLruItem<K, V>>
{
...
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public bool ShouldDiscard(LongTickCountLruItem<K, V> item)
{
if (Stopwatch.GetTimestamp() - item.TickCount > this.timeToLive)
{
return true;
}
return false;
}
...
}
As a result, the JITted code now includes branch 2 and the assembly code grows considerably to 312 bytes.
; BitFaster.Caching.Lru.TemplateConcurrentLru`5[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[BitFaster.Caching.Lru.TLruLongTicksPolicy`2[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]], BitFaster.Caching],[BitFaster.Caching.Lru.NullHitCounter, BitFaster.Caching]].TryGet(Int32, Int32 ByRef)
push rbp
push r15
push r14
push r13
push r12
push rdi
push rsi
push rbx
sub rsp,88
lea rbp,[rsp+0C0]
xor ebx,ebx
mov [rbp+0FFC0],rbx
mov [rbp+0FFB8],rbx
mov [rbp+20],r8
mov rsi,rcx
mov ebx,edx
lea rcx,[rbp+0FF70]
mov rdx,r10
call CORINFO_HELP_INIT_PINVOKE_FRAME
mov r14,rax
mov rcx,rsp
mov [rbp+0FF90],rcx
mov rcx,rbp
mov [rbp+0FFA0],rcx
mov rcx,[rsi+8]
lea r8,[rbp+0FFC0]
mov edx,ebx
cmp [rcx],ecx
call qword ptr [7FFED15290A0]
test eax,eax
je near ptr M01_L03
mov [rbp+10],rsi
mov rbx,[rsi+40]
mov r15,[rbp+0FFC0]
mov [rbp+0FFB0],r15
lea rcx,[rbp+0FFB8]
xor r11d,r11d
mov rax,offset MD_Interop+Kernel32.QueryPerformanceCounter(Int64*)
mov [rbp+0FF80],rax
lea rax,[M01_L00]
mov [rbp+0FF98],rax
lea rax,[rbp+0FF70]
mov [r14+10],rax
mov byte ptr [r14+0C],0
call qword ptr [7FFED151D7D0]
M01_L00:
mov byte ptr [r14+0C],1
cmp dword ptr [7FFED1524BD8],0
je short M01_L01
call qword ptr [7FFED1528278]
M01_L01:
mov rcx,[rbp+0FF78]
mov [r14+10],rcx
mov rcx,[rbp+0FFB8]
mov r15,[rbp+0FFB0]
sub rcx,[r15+18]
cmp rcx,rbx
jle short M01_L02
mov rcx,[rbp+10]
mov rdx,[rbp+0FFC0]
mov r8d,2
call BitFaster.Caching.Lru.TemplateConcurrentLru`5[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib],[BitFaster.Caching.Lru.TLruLongTicksPolicy`2[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]], BitFaster.Caching],[BitFaster.Caching.Lru.NullHitCounter, BitFaster.Caching]].Move(System.__Canon, BitFaster.Caching.Lru.ItemDestination)
xor eax,eax
mov rdi,[rbp+20]
mov [rdi],eax
jmp short M01_L04
M01_L02:
mov rax,[rbp+0FFC0]
mov eax,[rax+0C]
mov rdi,[rbp+20]
mov [rdi],eax
mov rax,[rbp+0FFC0]
mov byte ptr [rax+10],1
mov eax,1
jmp short M01_L04
M01_L03:
xor eax,eax
mov rdi,[rbp+20]
mov [rdi],eax
M01_L04:
movzx eax,al
mov byte ptr [r14+0C],1
lea rsp,[rbp+0FFC8]
pop rbx
pop rsi
pop rdi
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
; Total bytes of code 312