singlestore-labs/singlestoredb-python

Polars

Closed this issue · 6 comments

This library has been getting better and better, big thanks to esmit13 for the amazing work.

I was wondering if we could have one of the results_type be a polars dataframe or pyarrow array. I think this would give a very good performance for many users.
I am currently having to convert dataframe to polars or dictionary to polars and I think it could be a pretty quick change.

Thanks for the comments on the package! We're definitely trying to make it a cut above the standard MySQL clients.

We did have some experimental work in there at one point that would build numpy arrays in the C extension that could then easily be used by polars, pandas, and pyarrow. We ran into some problems in making it work correctly in unbuffered mode and haven't released it yet. It would be fairly easy to take the current tuple output and create DataFrames for polars and pandas, but it wouldn't be terribly optimal because it would have to go through Python objects then into the packed arrays. I might be able to look into doing the sub-optimal way for now and swap it out later. I guess having something that works sub-optimally might still be better than nothing at all.

Yes, something is always better than nothing.
Again thanks a lot for all your improvements!

I pushed a branch (https://github.com/singlestore-labs/singlestoredb-python/tree/result-types) for trying this feature out. It does create each of the output types (numpy, pandas, polars, and arrow), but it's pretty slow. This is mostly because it's having to create the tuples first, then convert the tuple form to the new output type. This means it will always be slower than creating tuples. Of course, that will be true of anyone taking the current output and creating the desired format themselves. I'm going to do some profiling to see if there are any places I can improve.

I did some performance tuning with polars since that's the one that seemed to have the biggest issue (primarily with date/times). Using any of the new result types (arrow, numpy, pandas, polars) still takes about twice as long as using tuples. That may not be noticeable for basic queries. I did my performance test using a 3.5 million row result.

I will note that this is more of a stop-gap solution until we get Apache Arrow support in the server (hopefully by the end of the year). At that point, we will use Arrow's conversion routines to get these result types and it will be much faster (faster than tuples).

I committed the new results types in #25. I'll do another release in the next few days.

@farooo2 v1.1.0 has the new result types in it. Let me know if you have any problems with it.