.NET implementation of https://github.com/gunnarmorling/1brc
If you want to compare results on the same hardware, open a PR with your implementation. Add it to a directory
1brc_githugusername
. This directory must contain a dotnet project1brc.csproj
(not renamed) and your code. You program must accept the first argument as the path to the measurements file.This should work from the repo dir: dotnet build 1brc_githugusername/1brc.csproj -c Release dotnet run --project 1brc_githugusername/1brc.csproj -c Release --no-build -- path/to/the/file.txt
Note that his implementation supports \r\n
line endings. The numbers in the Evolution section below are from Windows, where the input file is 13.7GB vs 12.8GB. It's generated by the upstream Java code on Windows so this is an implicit requirement for x-plat already.
The performance is measured on an idle 6C/12T Alder Lake CPU fixed at 2.5 GHz (no turbo), 32GB DDR4 3200, Debian 12 in LXC. At least 5 runs, showing the best result.
Note: results are very stable, usually only the 2nd decimal changes between the runs.
.NET
№ | JIT | AOT | Implementation | Runtime | Submitter |
---|---|---|---|---|---|
1. | 00:03.971 | 00:03.725 | THIS REPO | linux-x64 | Victor Baybekov |
2. | 00:05.979 | 00:06.657 | link | linux-x64 | Pedro Travi |
3. | 00:08.079 | 00:08.589 | link | linux-x64 | Fabien Barbier |
For AOT added this properties and
dotnet publish -r linux-x64 -c Release
<PublishAot>true</PublishAot> <OptimizationPreference>Speed</OptimizationPreference> <IlcInstructionSet>native</IlcInstructionSet> <PublishReadyToRun>true</PublishReadyToRun>
Interestingly AOT is beneficial for my code but detrimental for the other two versions.
Java (as of Jan 7 14:30 PM UTC)
№ | JIT | AOT | Implementation | Runtime | Submitter |
---|---|---|---|---|---|
1.* | 00:04.097 | ✖️ | link | 21.0.1-graal | Roy van Rijn |
1.* | 00:04.128 | ✖️ | link | 21.0.1-open | Quan Anh Mai |
3. | 00:06.344 | ✖️ | link | 21.0.1-graal | Elliot Barlas |
~ | 03:28.764 | ✖️ | link (baseline) | 21.0.1-open | Gunnar Morling |
* On my machine 1 and 2 places are swapped for Java. The best number is shown in the table instead of an average. However the individual results overlap a lot in a tight range. Therefore I believe it's a tie.
Below is the evolution of results with each commit. The time shown here is measured inside the app, on Windows. Not comparable with time
command (startup/shutdown time is not measured).
Mmap + paralell using Span API and some unsafe tricks to avoid Utf8 parsing until the very end.
Processed in 00:00:10.6978618
Processed in 00:00:10.8473143
Processed in 00:00:10.9107262
Processed in 00:00:10.9733218
Processed in 00:00:10.5854176
Processed in 00:00:09.7093471
Float parsing is ~57%, dictionary lookup is ~24%. Optimizing further is about those two things. We may use csFastFloat
library and a specialized dictionary such as DictionarySlim
. However the goal is to avoid dependencies even if they are pure .NET.
It's near-perfectly parallelizable though. On 8 cores it should be 33% faster than on 6 that I have. With 32GB RAM the file should be cached by an OS after the first read. The first read may be very slow in the cloud VM, but then the cache should eliminate the difference between drive speeds.
If we can assume that the float values are well formed then the speed almost doubles.
Processed in 00:00:05.5519479
No assumptions are required if we fallback to the full .NET parsing implementation on any irregularity.
Processed in 00:00:05.2944041
Processed in 00:00:05.3489315
Processed in 00:00:04.7363095
Processed in 00:00:04.8472097
Processed in 00:00:04.8235814
Processed in 00:00:04.7163938
Processed in 00:00:04.4547973
Processed in 00:00:04.5303938
Processed in 00:00:04.5125394
See comments in Utf8Span.GetHashCode
Processed in 00:00:04.2237865
Processed in 00:00:04.2524434
Processed in 00:00:04.2688423
Processed in 00:00:03.9916535 Processed in 00:00:03.9897462 Processed in 00:00:03.9810353
Set dictionary capacity to 10k.
`\n` only
Processed in 00:00:03.2530659
Processed in 00:00:03.1561451
`\r\n`
Processed in 00:00:03.3463769
Processed in 00:00:03.3641962
Processed in 00:00:03.3762491
Use AVX2-optimized IndexOf
.
Find boundaries for ;
, .
and \n
.
Optimize/simplify int parsing between ;
and .
and use a single digit that is always after .
.
Processed in 00:00:03.0687066
Processed in 00:00:03.0819551
Struct size is 16 👌
Avoid possible zero extension overhead in ParseInt and Hash, optimize ParseInt locals & loop
Extending short => int
is more expensive than short => ushort => uint => int
if all numbers are positive.
ParseInt loop was not good
The initial implementation was lazy with more operations than needed
This includes 3 last changes
Processed in 00:00:03.0355594
Processed in 00:00:02.9863322
Processed in 00:00:03.0102503