/sparkOptics

Optics for Spark DataFrames

Primary LanguageScalaApache License 2.0Apache-2.0

Build Status Maven Central Gitter Binder

Spark-optics

Modify your complex structures in spark-sql dataframes with optics.

Getting Started

Need to set an inner element in a complex structure?

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
val df: DataFrame = ???

import org.hablapps.sparkOptics._
df.select(Lens("field.subfield")(df.schema).set(lit(13)):_*)

Want to try it right now, click on the binder icon to lunch a interactive notebook.

Installing

Compiled for Scala 2.11 with Spark 2.3 and Scala 2.12 with Spark 2.4. Tested with spark 2.3, 2.4 in scala 2.11 and for 3.0.1 in scala 2.12.

libraryDependencies += "org.hablapps" %% "spark-optics" % "1.0.0"

Spark lens doesn't have any dependencies beyond Spark itself.

Implemented optics

Lens

Optic to focus in a column of a provided schema.

Protolens

Optic equivalent to a lens, but without an specific schema, so that it can be applied to any dataframe that contains the specified column.

Motivation and larger example

Working with complex structures in Spark Sql's DataFrames can be hard. A common case if you are working with complex structures is to modify the inner elements of the structure. For instance:

case class Street(number: Int, name: String)
case class Address(city: String, street: Street)
case class Company(name: String, address: Address)
case class Employee(name: String, company: Company)

val employee = Employee("john", Company("awesome inc", Address("london", Street(23, "high street"))))
val df = List(employee).toDS.toDF
root
 |-- name: string (nullable = true)
 |-- company: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- street: struct (nullable = true)
 |    |    |    |-- number: integer (nullable = false)
 |    |    |    |-- name: string (nullable = true)

In order to modify an inner element, like changing the name of the street, we need to do something like this:

val mDF = df.select(df("name"),struct(
   df("company.name").as("name"),
   struct(
     df("company.address.city").as("city"),
     struct(
       df("company.address.street.number").as("number"),
       upper(df("company.address.street.name")).as("name")
     ).as("street")
   ).as("address")
 ).as("company"))
mDF.printSchema
val longCodeEmployee = mDF.as[Employee].head
root
  |-- name: string (nullable = true)
  |-- company: struct (nullable = false)
  |    |-- name: string (nullable = true)
  |    |-- address: struct (nullable = false)
  |    |    |-- city: string (nullable = true)
  |    |    |-- street: struct (nullable = false)
  |    |    |    |-- number: integer (nullable = true)
  |    |    |    |-- name: string (nullable = true)
 
mDF: DataFrame = [name: string, company: struct<name: string, address: struct<city: string, street: struct<number: int, name: string>>>]
longCodeEmployee: Employee = Employee(
"john",
Company("awesome inc", Address("london", Street(23, "HIGH STREET"))))

This can be simplified by using spark-optics. It allows you to focus in the element that you want to modify, and the optics will recreate the structure for you.

import org.hablapps.sparkOptics._
import org.apache.spark.sql.functions._

val df: DataFrame = List(employee).toDF

val streetNameLens = Lens("company.address.street.name")(df.schema)
val modifiedDF = df.select(streetNameLens.modify(upper):_*)
modifiedDF.printSchema
modifiedDF.as[Employee].head
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
|    |-- name: string (nullable = true)
|    |-- address: struct (nullable = false)
|    |    |-- city: string (nullable = true)
|    |    |-- street: struct (nullable = false)
|    |    |    |-- number: integer (nullable = true)
|    |    |    |-- name: string (nullable = true)

streetNameLens: Lens = Lens(company.address.street.name)
modifiedDF: DataFrame = [name: string, company: struct<name: string, address: struct<city: string, street: struct<number: int, name: string>>>]
res19_3: Employee = Employee(
"john",
Company("awesome inc", Address("london", Street(23, "HIGH STREET")))
)