capitalone/datacompy

[Discussion] Deprecate the native Spark implementation in favour of Fugue or Pandas on Spark

fdosani opened this issue · 1 comments

@jdawang @ak-gupta @NikhilJArora Want your opinions on the above.

So right now the native spark implementation is a bit different than the Pandas, Polars, and Fugue versions.

  • It doesn't handle duplicates.
  • From my understanding it doesn't outer join like the other implementations and is only inner joining and doing a subtract for the left and right.
  • it has a different init signature with: known_differences, and column_mapping which leads to different functionality.

I'm thinking we can "deprecate" it while leaving it around for backwards compatibility for the next little while. If folks want to continue to use it they can explicitly import it. Maybe we can rename it LegacySparkCompare or something. My main goal is to consolidate and clean up the package.

We have a Pandas on Spark implementation which mimics the Pandas logic much closer, but obviously is also doing a lot more than this version. Based on the differences the performance is a bit lagging.

Reference for the Pandas on Spark implementation: #195