pyspark-ai/pyspark-ai

Create a join method

ericfeunekes opened this issue · 1 comments

I'm thinking about how to use this in my business to bring people into databricks instead of using low code platforms that are more difficult to support.

One limitation I see today is that you can only work with a single dataframe. There are a few enhancements that I think help:

  1. A table method. Similar to ingesting data from the web, but search through available schemas/tables to find the one that best fits the users query.
  2. A join method. If you have two spark ai dataframes it would be great to just do df.ai.join(df2) and have the tool figure out the best way to join. You could add the option for English explanation like df.ai.join(df2, "everything from df") for a left inner join.
  3. A code method. This should just provide the code that a user can copy/paste to achieve whatever it is the ai method did.

Obviously these are trivial to do in spark for the average spark user, but I'm thinking about how this could allow nontechnical users to fully interact with a whole catalog of data in English. Then the code method would allow them to make the notebook more deterministic if they wanted it to become a job or something.

Hi @ericfeunekes,

thanks for the valuable input! I am just back from vacation, sorry for the late reply.
Supporting multiple DataFrame joins and code generation only are always in our roadmap. We haven't done it since we were focusing on the production readiness of single DataFrame operations.
As for a table method, it seems more straightforward to have a text-to-sql method over multiple tables, such as https://python.langchain.com/docs/integrations/toolkits/spark_sql (which is also implemented by myself)