extend benchmarks

Question

extend benchmarks

matu3ba opened this issue 3 years ago · 4 comments

Please be explicit on 1. allocation, 2. constrains, 3. usage of SIMD

I would be curious, how fast other SIMD-accelerated parsers are in comparison. For example, there is https://github.com/geofflangdale/simdcsv

Answer 1 · 2022-02-25T14:54:50.000Z

The simdcsv library you mentioned was the first library we looked at when we wanted to find one instead of write our own. However, it appears that their library does not handle "real-world" CSV parsing, as they state here:

Real-life parsing of CSV files has to deal with a huge range of optional variations on what a CSV might look like. My plan is to initially focus on standards-compliant CSV files and potentially add some variations later

Handling the edge cases, while maintaining performance, is perhaps the single most challenging part of designing this library, so a comparison against a parser that ignores the edge cases is of limited value

Answer 2 · 2022-02-25T14:57:56.000Z

Re your question sub-points 1, 2 and 3, not sure what you are requesting but happy to consider/accept contributions (even partial) if you have specific suggestions as to changes in the repository contents that may help to address them

Answer 3 · 2022-02-25T18:34:27.000Z

Probably the description should include "CPU-based", as GPU-parsing sounds like it can be faster: https://github.com/antonmks/nvParse

Answer 4 · 2022-02-25T19:29:26.000Z

We're not going to rename the library because the usual assumption is that a C library uses CPU and not GPU e.g. libz is not called "libz-cpu".

Furthermore, just because something sounds like it can be faster doesn't mean it is.

First, GPU cannot be assumed to always be available, and in many circumstances (such as when running as web assembly, or in a typical serverless computing environment), it most definitively is not.

Second, even when GPU is available, there is still significant overhead in loading the memory into the GPU and getting it back out

Third, it is much harder to handle the edge cases with GPU instructions, and we are not inclined to spend our time solving that problem, especially given the above limitations (in particular, the first).

If anyone can point to a GPU library that works for CSV parsing including edge cases, and can be compiled to run "count" and "select", then we'll happily consider adding it to the benchmark. Until then, we don't have the resources here to add benchmarks against incomplete alternative parsers.