tixxit/framian

Add joinBy method to Frame

tixxit opened this issue · 6 comments

In a few places it would be nice to have a joinBy method:

trait Frame[Row, Col] {
  ...
  def joinBy[A: Order: ClassTag](by: Cols[Col, A])(that: Frame[A, Col]): Frame[Row, Col]
  ...
}

I have two variations on the same approach. Both produce a frame with the original keys...however:

  • The first has to drop any keys from the "that" frame since type A != Row.
  • The second includes keys from both frames, but changes the method to use Row and Col types for all frames to do it.

Not sure how to have my cake and eat it too here:

  // Original version that only has keys from "this" frame:
  def joinBy1[A: Order: ClassTag](by: Cols[Col, A])(that: Frame[A, Col]): Frame[Row, Col] = {
    val reIdx = reindex(by)
    val thatReIdx = that.reindex(by)
    val joined = reIdx.merge(thatReIdx)(Merge.Outer)
    val (k, v) = joined.rowIndex.flatMap{ case (r, i) =>
      reIdx.rowIndex.getAll(r).indices.map(k => (rowIndex.keyAt(k), i))
    }.unzip
    joined.withRowIndex(Index.apply(k, v))
  }
  // Version where types match and keys from both are in result:
  def joinBy2(by: Cols[Col, Row])(that: Frame[Row, Col]): Frame[Row, Col] = {
    val reIdx = reindex(by)
    val thatReIdx = that.reindex(by)
    val joined = reIdx.merge(thatReIdx)(Merge.Outer)
    val (k, v) = joined.rowIndex.flatMap{ case (r, i) =>
      reIdx.rowIndex.getAll(r).indices.map(k => (rowIndex.keyAt(k), i)) ++
        thatReIdx.rowIndex.getAll(r).indices.map(k => (that.rowIndex.keyAt(k), i))
    }.unzip
    joined.withRowIndex(Index.apply(k, v))
  }

thoughts?

I prefer joinBy1. It would be nice if we had the option of Inner or Left (Right and Outer are Inner and Left because we drop the right keys). Perhaps, innerJoinBy(...)(...) and leftJoinBy(...)(...)? What do you think?

I like it. But does it make sense to be consistent with the other methods and take the Join as a param? Even though Right = Inner and Outer = Left they are all valid arguments right?

My only issue is that while they are technically valid, they would be confusing. I'd rather remove that point of confusion... However, I am not against making a sub-trait of Join that only works for Left/Inner and Right/Inner... so, something like this:

sealed trait Join {
  val leftOuter: Boolean
  val rightOuter: Boolean
}

sealed trait LeftBiasedJoin extends Join {
  val rightOuter: Boolean = false
}

sealed trait RightBiasedJoin extends Join {
  val leftOuter: Boolean = false
}

object Join {
  case object Inner extends LeftBiasedJoin with RightBiasedJoin
  case object Left extends LeftBiasedJoin { val leftOuter = true }
  case object Right extends RightBiasedJoin { val rightOuter = true }
  case object Outer extends Join { val leftOuter = true; val rightOuter = true }
}

And then we have:

def joinBy(by: Cols[Col, A])(that: Frame[_, Col])(join: LeftBiasedJoin): Frame[Row, Col]

Of course, the major issue here is we break binary compatibility, so it would have to be in the 0.4.0 release.

Probably not a hard breaking change to handle when released. I have this now:

  def joinBy[A: Order: ClassTag](by: Cols[Col, A])(that: Frame[A, Col])(join: LeftBiasedJoin): Frame[Row, Col] = {
    val reIdx = reindex(by)
    val joined = reIdx.join(that.reindex(by))(join)
    joined.withRowIndex(
      joined.rowIndex.flatMap{ case (r, i) =>
        reIdx.rowIndex.getAll(r).indices.map(k => (rowIndex.keyAt(k), i))
      }.unzip match {
        case (k, v) => Index.ordered(k, v)
      }
    )
  }

I can clean it up and add the PR if you like.

Please - just make the rows parameter on that a wildcard (Frame[_, Col]), since I don't think you meant to put the A from Cols[Col, A] there ;)