cloudera/impyla

Status of Impyla + Python 3.10

cpcloud opened this issue · 12 comments

Hi impyla devs!

We make use of impyla heavily in ibis, and as I'm sure y'all know cloudera was where ibis originated as a library to give users a first-class analytics experience in Python.

I and many others appreciate the effort and support that has gone into building impyla.

Since ibis and impyla were created, ibis has gained a significant number of users using the impala backend and for a while impyla and ibis worked well together.

In 2022, the impala backend has become a maintenance burden that many of the ibis developers are reluctant to work on. There are a few reasons for this but one of them is that this library doesn't appear to be actively maintained. The impala backend has a hard dependency on impyla which makes it core to the basic functionality of backend.

An example of a problem we're currently facing is that impyla does not seem to support Python 3.10. It looks like one reason for that is because the thrift API is generated using a version of the thrift compiler that doesn't generate 3.10 compatible code. I imagine regenerating the thrift API will cause a ton of breakage, and therefore require a long time to make it into a release. I am happy to provide more details if desired.

Without knowing anything else it's getting harder to justify keeping the impala backend contained inside of the core ibis repository. Moving it out is a last resort, but we can't continue to support the impala backend forever without some help from the folks working on this project.

I would love to have a discussion about what the status of the library is and whether things like Python 3.10 support are on the table.

Thanks!

Hi!

I imagine regenerating the thrift API will cause a ton of breakage, and therefore require a long time to make it into a release. I >am happy to provide more details if desired.

Yes, regenerating the Thrift API can be a bit tricky - we usually do it with the Thrift compiler used by Impala, which is 0.11.0 Thrift + a few patches at the moment: https://github.com/cloudera/native-toolchain/tree/master/source/thrift/thrift-0.11.0-patches
The non-trivial part was supporting non-ASCII UTF-8 strings in both Python 2 and 3 environment.

Do you know the Thrift change that fixed this? I found THRIFT-5488 that seems related.

An example of a problem we're currently facing is that impyla does not seem to support Python 3.10.

Which version of Impyla do you use for testing? Is it possible that this was broken during
15f202e ?

I tested Impyla with both Python 2 and 3 ( though I am not sure about the exact Python 3 version, probably 3.9), when I was doing Thrift related changes in Impyla.

Hi!

I imagine regenerating the thrift API will cause a ton of breakage, and therefore require a long time to make it into a release. I >am happy to provide more details if desired.

Yes, regenerating the Thrift API can be a bit tricky - we usually do it with the Thrift compiler used by Impala, which is 0.11.0 Thrift + a few patches at the moment: https://github.com/cloudera/native-toolchain/tree/master/source/thrift/thrift-0.11.0-patches The non-trivial part was supporting non-ASCII UTF-8 strings in both Python 2 and 3 environment.

Do you know the Thrift change that fixed this? I found THRIFT-5488 that seems related.

Yes, THRIFT-5488 is the exact problem we're hitting with 3.10.

An example of a problem we're currently facing is that impyla does not seem to support Python 3.10.

Which version of Impyla do you use for testing? Is it possible that this was broken during 15f202e ?

I tested Impyla with both Python 2 and 3 ( though I am not sure about the exact Python 3 version, probably 3.9), when I was doing Thrift related changes in Impyla.

We're using impyla 0.17.0. Here's poetry show impyla for ibis' master branch:

❯ poetry show impyla
name         : impyla
version      : 0.17.0
description  : Python client for the Impala distributed query engine

dependencies
 - bitarray *
 - kerberos >=1.3.0
 - six *
 - thrift 0.11.0
 - thrift-sasl 0.4.3

Hi! Sorry for the long delay.

An issue with bumping Thrift version is that Thrift 0.16.0 is not yet uploaded to pip. I have written to the Apache Thrift user list about this.

Can you provide more info on how to reproduce the error? Running the Impyla test suite with Python 3.10 was successful on my machine (Ubuntu 18.04), though I have noticed that the native Thrift components were not loaded, so THRIFT-5488 was not hit because the fallback Python code was used (making Impyla much slower).

Hi! Sorry for the long delay.

An issue with bumping Thrift version is that Thrift 0.16.0 is not yet uploaded to pip. I have written to the Apache Thrift user list about this.

Can you provide more info on how to reproduce the error? Running the Impyla test suite with Python 3.10 was successful on my machine (Ubuntu 18.04), though I have noticed that the native Thrift components were not loaded, so THRIFT-5488 was not hit because the fallback Python code was used (making Impyla much slower).

Does fallback Python code mean using thriftpy2? If so, it would seem impyla has an undeclared dependency on that package.

Does fallback Python code mean using thriftpy2?

The fallback happens inside Thrift library:
https://github.com/apache/thrift/blob/master/lib/py/src/protocol/TBinaryProtocol.py#L278

If it cannot load the fastbinary module, then it falls back to generated python code to (de)serialize Thrift structs.

We used thriftpy2 with Python3 before Impyla 0.17.0, but now thriftpy is completely removed:
15f202e
So Python 3 always used thriftpy2. while Python 2 always used Apache Thrift/

The main benefit of switching to Apache Thrift with Python 3 was increased speed in (~4x faster deserialization if fastbinary is loaded).

This means that using Impyla 0.16.0 (or anything <=0.17a5) is a possible workaround with Python 3.10, as it still used ThriftPy2

An issue with bumping Thrift version is that Thrift 0.16.0 is not yet uploaded to pip.

Meanwhile Thrift 0.16.0 was uploaded to pip and using it solved the issue on my machine - I am running more tests at the moment to see whether this causes some regression. If all tests are green, then I will create a release with bumped Thrift version.

Hi, @csringhofer
I have a question for this issue.
Could I know release plan to impyla with #490?

@seokbaeklee
0.18a4 was upload to pypi yesterday and it contains #490
So far it looks good, but I am still running some tests to see if anything comes up. Will write an update when the testing is finished.

I tested 0.18a4 with ibis and Python 3.10 and can confirm that all of our impala unit and integration tests pass.

Thanks @csringhofer!

I tested version0.18a4and it works in our working environment, thanks!!!
However, I think it might be useful to fix the README.md since it contains a reference to thrift==0.11.0 that is no longer true.

Thanks @agdiiura for spotting the README!
Updated it in 119cae7

Released Impyla 0.18.0, so now there is an official release that supports Python 3.10:
https://pypi.org/project/impyla/0.18.0/

Closing this issue.

Thanks for getting this done!