jdunck/python-unicodecsv

unicodecsv is kind of slow; but maybe unavoidable?

NelsonMinar opened this issue · 5 comments

Thank you so much for unicodecsv, it's been a big help for me in Python2. Not to sound ungrateful, but...

unicodecsv seems fairly slow. Some benchmarking suggests it's about 5-6x slower than the plain Py2 csv module. Of course it's doing more work, decoding bytes to strings! But for comparison the Py3 csv module (which does decoding) is only 2-3x slower than Py2. Is there room for improvement in unicodecsv?

I did some profiling and code reading and didn't see any obvious way unicodecsv could be made faster. So maybe there's no real way to optimize it. But wanted to file the issue both to document what I learned and get a second opinion.

My benchmark code and results are at https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/

I don't see the file you were using, but for the 1M line CSV file I was playing with today, I found that the isinstance() calls in UnicodeReader#next were taking around 50% of the runtime. And unless the dialect requests QUOTE_NONNUMERIC, that's never going to hit.

I've submitted a pull request, /pull/47, which avoids the isinstance() call here in this case. It's still about 3x slower than the built-in (ASCII) 'csv' module, but it's significantly faster than before.

I noticed a fair amount of time with isinstance too but assumed it was unavoidable. Sounds like your code is a good improvement if it works!

I spent some time looking at the speed of Python Unicode decoding and am more confused than ever as to exactly what's going on with the larger speed issue. https://nelsonslog.wordpress.com/2015/02/26/python-file-reading-benchmarks/

@NelsonMinar thanks for the detailed benchmarking. I'll leave this open as a reminder to do other optimization work, but I've merged #47.

I've just released 0.11.0, which includes changes in #47.

Nice, thanks for the update! I just tested it and it makes my benchmark run in 70-80% of the time it used to. Very nice improvement for a simple change. Detailed timings: https://nelsonslog.wordpress.com/2015/03/11/unicodecsv-0-11-0-speed-improvement/