[Discussion] Deprecate the native Spark implementation in favour of Fugue or Pandas on Spark
fdosani opened this issue · 1 comments
fdosani commented
@jdawang @ak-gupta @NikhilJArora Want your opinions on the above.
So right now the native spark implementation is a bit different than the Pandas, Polars, and Fugue versions.
- It doesn't handle duplicates.
- From my understanding it doesn't outer join like the other implementations and is only inner joining and doing a subtract for the left and right.
- it has a different init signature with:
known_differences
, andcolumn_mapping
which leads to different functionality.
I'm thinking we can "deprecate" it while leaving it around for backwards compatibility for the next little while. If folks want to continue to use it they can explicitly import it. Maybe we can rename it LegacySparkCompare
or something. My main goal is to consolidate and clean up the package.
We have a Pandas on Spark implementation which mimics the Pandas logic much closer, but obviously is also doing a lot more than this version. Based on the differences the performance is a bit lagging.