MrPowers/mack

Brainstorm correct ways to include PySpark & Delta dependencies in pyproject.toml file

Opened this issue · 4 comments

Users have to supply a correct combination of Spark & Delta Lake versions for their setup to work, see the compatibility matrix.

Mack depends on PySpark & Delta Lake. We want Mack to work with a variety of Spark & Delta Lake combinations.

Here's how the dependencies are currently specified in the pyproject.toml file:

[tool.poetry.dependencies]
python = "^3.9"

[tool.poetry.dev-dependencies]
pre-commit = "^2.20.0"
pyspark = "3.3.1"
delta-spark = "2.1.1"
pytest = "7.2.0"
chispa = "0.9.2"
pytest-describe = "^1.0.0"

I'm not sure the best way to specify dependencies using Poetry to give our users the best Mack download experience. Thoughts?

I am not sure what would be the correct way, but maybe we could apply something like:

[tool.poetry.extras]
delta2.2-spark3.3 = ["delta-spark^2.2.0", "pyspark^3.3.1"]
delta2.1-spark3.3 = ["delta-spark^2.1.1", "pyspark^3.3.1"]

and then the user would just need to run poetry update; poetry install -E delta2.1-spark3.3 to install desired dependencies.

One of the pitfalls would be making sure that python3.9 works for all dependencies combinations and to create tests
for each combination to make sure it works with the application code? What do you think?

I haven't given this idea a try yet, but looking at this issue gave me the impression that this could work.

@joao-fm-santos - this blog post has more context on the issue from a usability perspective.

Mack will typically be included as a dependency in other files. I'm not sure how we setup a Python project to correctly install a specific Delta Lake version based on the PySpark version that the user specified....

@danielbeach - FYI, we're looking into this issue.

@alexott - feel free to provide suggestions.

@MrPowers thanks for the blog post, really helpfull!
Unless I understood the problem incorrectly, I believe adding extras would be a good way to solve this issue, as it allows users to use common poetry syntax to install dependencies, allowing to choose what version they prefer.

For example, a user could:

  • add the dependency directly in their pyproject.toml file like so:
[tool.poetry.dependencies]
mack= {version = "*", extras = ["delta2.2-spark3.3"]}
  • or add the dependency via the command line:
poetry add 'mack["delta2.2-spark3.3]'

For the pip installation, I believe we would need to change setup.cfg to include extras like so.

I have not tried this, but let me know if I am missing the point here!

@joao-fm-santos - yea, extras could be the right way to solve this. I don't know.

We need a solution that will work in a variety of execution contexts:

  • when a user runs pip install mack, they should get the required dependencies installed
  • when a user includes mack in an environment file that includes PySpark (and doesn't include Delta), then mack should include the right Delta version as a transitive dependency
  • when a user includes mack in an environment file that doesn't have either PySpark or Delta, they should both get added as transitive dependencies
  • users should be able to build mack wheel files and attach them to PySpark clusters
  • users should be able to run pip install mack on an existing PySpark cluster and get all the dependencies installed.

One of my other projects uses a library called findspark. Is it possible we need a library like finddelta?