TinyCsvParser Benchmark
Closed this issue · 2 comments
Hi @joelverhagen!
Thanks for this post! Your results for TinyCsvParser are correct. The Line Tokenizer I have implemented is inefficient and I knew about this. Very cool, that it's indeed the slowest implementation in your benchmarks. 🥇
I think it is doing way too much allocations and has a weird approach. I really should have paid way more attention in lectures on Finite State Machines and Compilers!
So if you feel like, you are (very very) welcome to replace my implementation with your version and make a Pull Request to TinyCsvParser. You just need to plug it in here (the tokenizer maybe shared between threads, so better not share state):
So don't take any of this as some sort of criticism, your results are correct.
If this is so slow, what's the use case for the library then?
So you are saying you are reading large files and that means you probably have dozens of cores idling, when parsing a file sequentially. How long does a modern SSD take to read such a tiny file with a million lines? Maybe a millisecond? Now put some object mapping, conversions and validation on top of the other parsers and you'll see the overhead.
At some point you'll notice: Reading the file and tokenizing it isn't the bottleneck anymore.
TinyCsvParser uses PLINQ in its pipeline, so it's trivial to parallelize the whole thing and do the result mapping and whatever you want with the data in parallel. This of course has one severe drawback: You cannot parse multi-line CSV data. But if you are OK with that, cool.
And what's important here: You can easily switch the Tokenizer to a different implementation.
Switching the Line Tokenizer to a string.Split
implementation yields something around 4400 ms
on your test data, for tokenizing the line, converting some properties to a DateTime
and doing the object mapping. Is this unfair, because I am utilizing X cores and the other parsers don't? Yes! Could you do this with the other parsers? Maybe definitely yes!
The StringSplitTokenizer
yields:
Parsed 1000000 valid lines ...
[Reading G:\Github\TinyCsvParser\TinyCsvParser\TinyCsvParser.Test\bin\Debug\net452\test_file.txt ...] Elapsed Time = 4475.574 Milliseconds
Using a custom RFC4180 CustomTokenizer
yields:
Parsed 1000000 valid lines ...
[Reading G:\Github\TinyCsvParser\TinyCsvParser\TinyCsvParser.Test\bin\Debug\net452\test_file.txt ...] Elapsed Time = 4849.2588 Milliseconds
Here is the full test:
using NUnit.Framework;
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using TinyCsvParser.Mapping;
using TinyCsvParser.Tokenizer;
using TinyCsvParser.Tokenizer.RFC4180;
namespace TinyCsvParser.Test.Integration
{
[TestFixture]
[Explicit("https://github.com/joelverhagen/NCsvPerf/issues/2")]
public class NCsvPerfBenchmark
{
private class CustomTokenizer : ITokenizer
{
public string[] Tokenize(string input)
{
var result = new List<string>();
bool isInQuotes = false;
var chars = input.ToCharArray();
StringBuilder str = new StringBuilder(string.Empty);
foreach (var t in chars)
{
if (t == '"')
{
isInQuotes = !isInQuotes;
}
else if (t == ',' && !isInQuotes)
{
result.Add(str.ToString());
str.Clear();
}
else
{
str.Append(t);
}
}
result.Add(str.ToString());
return result.ToArray();
}
}
private class TestModel
{
public string Id { get; set; }
public DateTime LastCrawled { get; set; }
public string Project { get; set; }
public string ProjectVersion { get; set; }
public DateTime LastUpdate { get; set; }
public string Assets { get; set; }
public string RuntimeAssemblies { get; set; }
public string Placeholder1 { get; set; }
public string Platform { get; set; }
public string Runtime { get; set; }
public string Placeholder2 { get; set; }
public string Placeholder3 { get; set; }
public string Placeholder4 { get; set; }
public string Placeholder5 { get; set; }
public string Placeholder6 { get; set; }
public string Placeholder7 { get; set; }
public string Placeholder8 { get; set; }
public string Filename1 { get; set; }
public string Filename2 { get; set; }
public string Extension { get; set; }
public string Type { get; set; }
public string Target1 { get; set; }
public string Target2 { get; set; }
public string RuntimeVersion { get; set; }
public string Version { get; set; }
}
private class TestModelMapping : CsvMapping<TestModel>
{
public TestModelMapping()
{
MapProperty(0, x => x.Id);
MapProperty(1, x => x.LastCrawled);
MapProperty(2, x => x.Project);
MapProperty(3, x => x.ProjectVersion);
MapProperty(4, x => x.LastUpdate);
MapProperty(5, x => x.Assets);
MapProperty(6, x => x.RuntimeAssemblies);
MapProperty(7, x => x.Placeholder1);
MapProperty(8, x => x.Platform);
MapProperty(9, x => x.Runtime);
MapProperty(10, x => x.Placeholder2);
MapProperty(11, x => x.Placeholder3);
MapProperty(12, x => x.Placeholder4);
MapProperty(13, x => x.Placeholder5);
MapProperty(14, x => x.Placeholder6);
MapProperty(15, x => x.Filename1);
MapProperty(16, x => x.Filename2);
MapProperty(17, x => x.Extension);
MapProperty(18, x => x.Type);
MapProperty(19, x => x.Target1);
MapProperty(20, x => x.Target2);
MapProperty(21, x => x.RuntimeVersion);
MapProperty(22, x => x.Placeholder7);
MapProperty(23, x => x.Placeholder8);
MapProperty(24, x => x.Version);
}
}
[Test]
public void RunStringSplitTokenizerTest()
{
var options = new CsvParserOptions(false, new CustomTokenizer());
var mapping = new TestModelMapping();
var parser = new CsvParser<TestModel>(options, mapping);
string filename = GetTestFilePath();
MeasurementUtils.MeasureElapsedTime(
description: $"Reading {filename} ...",
action: () =>
{
var cnt = parser
.ReadFromFile(filename, Encoding.UTF8)
.Where(x => x.IsValid)
.Count();
TestContext.WriteLine($"Parsed {cnt} valid lines ...");
},
timespanFormatter: x => $"{x.TotalMilliseconds} Milliseconds");
}
[Test]
public void RunCustomTokenizerTest()
{
var options = new CsvParserOptions(false, new CustomTokenizer());
var mapping = new TestModelMapping();
var parser = new CsvParser<TestModel>(options, mapping);
string filename = GetTestFilePath();
MeasurementUtils.MeasureElapsedTime(
description: $"Reading {filename} ...",
action: () =>
{
var cnt = parser
.ReadFromFile(filename, Encoding.UTF8)
.Where(x => x.IsValid)
.Count();
TestContext.WriteLine($"Parsed {cnt} valid lines ...");
},
timespanFormatter: x => $"{x.TotalMilliseconds} Milliseconds");
}
[Test]
public void RunRfc4180TokenizerTest()
{
var options = new CsvParserOptions(false, new RFC4180Tokenizer(new Options('"', '\\', ',')));
var mapping = new TestModelMapping();
var parser = new CsvParser<TestModel>(options, mapping);
string filename = GetTestFilePath();
MeasurementUtils.MeasureElapsedTime(
description: $"Reading {filename} ...",
action: () =>
{
var cnt = parser
.ReadFromFile(filename, Encoding.UTF8)
.Where(x => x.IsValid)
.Count();
TestContext.WriteLine($"Parsed {cnt} valid lines ...");
},
timespanFormatter: x => $"{x.TotalMilliseconds} Milliseconds");
}
[SetUp]
public void SetUp()
{
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < 1_000_000; i++)
{
stringBuilder.AppendLine("75fcf875-017d-4579-bfd9-791d3e6767f0,2020-11-28T01:50:41.2449947+00:00,Akinzekeel.BlazorGrid,0.9.1-preview,2020-11-27T22:42:54.3100000+00:00,AvailableAssets,ResourceAssemblies,,,net5.0,,,,,,lib/net5.0/de/BlazorGrid.resources.dll,BlazorGrid.resources.dll,.dll,lib,net5.0,.NETCoreApp,5.0.0.0,,,0.0.0.0");
}
var testFilePath = GetTestFilePath();
File.WriteAllText(testFilePath, stringBuilder.ToString(), Encoding.UTF8);
}
[TearDown]
public void TearDown()
{
var testFilePath = GetTestFilePath();
File.Delete(testFilePath);
}
private string GetTestFilePath()
{
#if NETCOREAPP1_1
var basePath = AppContext.BaseDirectory;
#else
var basePath = AppDomain.CurrentDomain.BaseDirectory;
#endif
return Path.Combine(basePath, "test_file.txt");
}
}
}
And the MeasurementUtils:
using NUnit.Framework;
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace TinyCsvParser.Test.Integration
{
public static class MeasurementUtils
{
public static void MeasureElapsedTime(string description, Action action)
{
// Get the elapsed time as a TimeSpan value.
TimeSpan ts = MeasureElapsedTime(action);
// Format and display the TimeSpan value.
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10);
TestContext.WriteLine("[{0}] Elapsed Time = {1}", description, elapsedTime);
}
public static void MeasureElapsedTime(string description, Action action, Func<TimeSpan, string> timespanFormatter)
{
// Get the elapsed time as a TimeSpan value.
TimeSpan ts = MeasureElapsedTime(action);
string elapsedTime = timespanFormatter(ts);
TestContext.WriteLine("[{0}] Elapsed Time = {1}", description, elapsedTime);
}
private static TimeSpan MeasureElapsedTime(Action action)
{
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
action();
stopWatch.Stop();
return stopWatch.Elapsed;
}
}
}
Wow, awesome approach! Just to make sure I understand correctly:
- A
TextReader
/File.ReadLines
/string
is split into lines (onestring
per line in source data) - Rows are both tokenized and mapped to the output object in parallel using PLINQ
In other words, the only non-parallel work is splitting a block of bytes/string into lines prior to any CSV-specific work. The rest (the real "CSV parsing" step) is parallel.
Very cool! Nice work making your library support this approach of the box!
You cannot parse multi-line CSV data.
I think this could be mitigated by implementing a lightweight, custom line splitter that is aware of multiline CSV. Perhaps this could be done in a way that isn't much more expensive than however TextReader.ReadLine()
is implemented.
Yes! Could you do this with the other parsers? Maybe definitely yes!
I think this approach would most trivially be applied to parsers that don't maintain their own buffer and already operate line by line. If the parser takes bytes and emits a string[]
per line all in one function, it would be harder to have the "serial line splitting + parallel CSV parsing" approach.
In conclusion, thanks a bunch for bringing this to my attention. I think my blog post waved hands a bit concerning the pipeline that takes bytes and emits a materialized list of record objects. Perhaps a better wording would have focused solely on the CSV tokenization/parsing step. That was my intent since it allows clever parallelization, activation, mapping, etc. all built on top of a fast tokenization routine.
@joelverhagen Yes, I have never benchmarked if it really makes sense to do the tokenizing in parallel. If not, one could indeed make the Tokenizer Multi-Line aware and from there on do the mapping.
We can close this issue, because there is no issue. ✌️