Update ByteTokenizer to remove TensorFlow dependency

Question

Update ByteTokenizer to remove TensorFlow dependency

Opened this issue 4 months ago · 1 comments

Is your feature request related to a problem? Please describe.
When testing Keras 3 in the MLX branch, I got a TensorFlow import error. I explicitly set the environment variable KERAS_BACKEND="mlx" with os.environ in the file and got the same error.

Looking through the source of ByteTokenizer, I noticed the TensorFlow import

Describe the solution you'd like
Multi-backend support like Keras Core, and as this lib is advertised with Keras 3.

Describe alternatives you've considered
I tried uninstalling Keras and keras-nlp and reinstalling from pip (both stable and nightly) as well as Jax and switched my KERAS_BACKEND to that, only to receive the same error.

Additional context
That's pretty much it.

Answer 1 · 2024-02-28T21:48:04.000Z

Thanks for filing!

In short this is expected. We do have a dependency on tensorflow for all preprocessing, and that is specifically for the tf.data library. KerasNLP will still be running all training and inference with the backend you select, but all preprocessing will run with tf.data regardless of backend.

I agree with you this is a pain point. We do it because tf.data is a quite scalable solution for preprocessing. You can run our preprocessing without python if you compile it properly. You can go from single machine, multi-core preprocessing, up to a cluster of machines all coordinating, all with the same tech stack. It's fast.

However, we would really like to find a good performant preprocessing solution that does not require a tensorflow installation. That could come in a lot of forms, from just separating pip packages, to re-building our preprocessing from the ground up. We are actively looking at this, though I don't think this will be a quick/easy change. We will broadcast plans more here once we have them!