delta-io/connectors

Delta Standalone: Jar Hell - Hadoop dependencies

VladimirPchelko opened this issue · 8 comments

package size becomes too large due to Hadoop dependencies

@VladimirPchelko The package itself is not large. Do you mean when you build an uber jar that includes all jars, it's too large? Unfortunately, there is no great solution for this. We need Hadoop to access the cloud storage that's using Hadoop FileSystem APIs to provide the access.

due to hadoop dependencies it is very problematic to use your project in AWS lambdas (needed uber jar) - without complicated manipulations, the packet size exceeds the limits

Unfortunately, there is little we can do. We need Hadoop to access the cloud storage that's using Hadoop FileSystem APIs. I found this StackOverflow link that may be able to use a larger jar: https://stackoverflow.com/a/72646550/1038826 Could you check if it helps?

@VladimirPchelko - did this new approach work for you?

Would it not be better not to rely on the Hadoop libraries for cloud storage access, since all cloud storage providers have pure APIs in multiple languages. As an example, delta-rs, the Rust implementation of Deltalake is a good example where they've done this cleanly.

For example, we turned to Deltalake (and cloud storage) to move completely away from Hadoop, but now we find again having to depend on Hadoop libraries in order to use Deltalake, which to my mind doesn't make much sense.

This repo has been deprecated and the code is moved under connectors module in https://github.com/delta-io/delta repository. Please create the issue in repository https://github.com/delta-io/delta. See #556 for details.