tspence/csharp-csv-reader

Have you compared parsing to CsvHelper?

proca-ainq opened this issue · 2 comments

I'd be interested to see what kind of cases your library can handle that the popular CsvHelper doesn't handle. I've had trouble finding a bulletproof CSV parser for C# at the quality that you find for Java.

Thanks for the feedback! I wrote this code almost two decades ago when I was first learning C# / DotNet; at the time that I wrote it, I wasn't aware of other CSV helper libraries. I chose to open source this circa 2012 so that I could experiment with building and publishing NuGet packages, but at the same time, I did experience some weird CSV formatting problems and I made sure this library could handle them.

Here are a few of the parsing problems I encountered that I designed this library to handle:

  1. Some CSV files included embedded newlines, which many CSV parsers would treat as two separate lines rather than one line with an embedded newline.
  2. Some encoders chose to use different delimiters (commas, tabs, pipe symbols) and some encoders used unusual text qualifiers (double quotes, single quotes, etc). I chose to make this programmable.
  3. A handful of CSV files included spaces after the comma before the text delimiter. One file would have "a","b","c" and another file might have "a", "b", "c"
  4. Some CSV parsers that I found were generic "table format parsers", and they would also parse XLS files and things like that. Those parsers turned out to be huge and complicated, and they had lots of dependencies.
  5. I worked with some legacy code that was stuck on DotNet 2 and couldn't be upgraded for whatever reason. I chose to spend time to make this library backwards compatible so that I could include it in whatever project I worked on, no matter how old. That said, I've never had anyone ask me for support for DotNet 1. ;)
  6. Over all, the biggest problem I faced was that most CSV parsers decoded everything into memory at once. I rewrote this parser so that I could handle terabyte-sized files that I received from videogame companies analyzing events, and that I could stream them off disk without parsing the entire thing into memory at once.

After creating this, I also decided to tinker with automated testing. If you're curious, you can examine some of the test cases in this project to see how they are handled.

Hope this helps!

Let me know if you run into any other questions :)