Gmousse/dataframe-js

[FEATURE] Merge dataframes with different columns

Opened this issue · 2 comments

Is your feature request related to a problem? Please describe.
I'd like to merge DataFrames with different columns.

Describe the solution you'd like
I'd like to have a df1.merge(df2) way to automatically merge two dataframes, even if a column is in df1 but not in df2, filling it with

Describe alternatives you've considered
Here is a snippet from @lmeyerov I found (and completed) on issue #15, that makes just what I want :

function unionDFs(a, b, fill='n/a') {
    // Merge two dataframes with different columns
    const aCols = a.listColumns(); // this line was missing on lmeyerov's original snippet
    const bCols = b.listColumns(); // this line was missing on lmeyerov's original snippet

    const aNeeds = b.listColumns().filter((v) => aCols.indexOf(v) === -1);
    const bNeeds = a.listColumns().filter((v) => bCols.indexOf(v) === -1);

    const a2 = aNeeds.reduce((df, name) => df.withColumn(name, () => fill), a);
    const b2 = bNeeds.reduce((df, name) => df.withColumn(name, () => fill), b);

    return a2.union(b2);
}

Additional context
Current behaviour
Capture d’écran 2020-09-27 à 16 13 57

What I'd like
Capture d’écran 2020-09-27 à 16 16 06

A better implementation of the unionDFs snippet :

DataFrame.prototype.merge = function(df2, fill = null) {
    // Merge two dataframes with different columns
    const aCols = df2.listColumns();
    const bCols = this.listColumns();

    const aNeeds = this.listColumns().filter((v) => aCols.indexOf(v) === -1);
    const bNeeds = df2.listColumns().filter((v) => bCols.indexOf(v) === -1);

    const a2 = aNeeds.reduce((df, name) => df.withColumn(name, () => fill), df2);
    const b2 = bNeeds.reduce((df, name) => df.withColumn(name, () => fill), this);

    return a2.union(b2);
}

This bug can be particularly insidious - if one dataframe's columns are a subset of another's, the behavior is inconsistent.

  • If you concatenate the df with fewer columns to the one with all columns, the union will execute without issue.
  • If you concatenate the df with all columns to the one with fewer, then it will fail.

This error is due to the use of an incorrect column comparison. It is still an issue in master:

export function arrayEqual(a, b, byOrder = false) {