spark-dataset
Convenience loader methods for common datasets, which can be used for testing in both of Spark Application and REPL environment.
You can launch a spark shell with builded jar files with the command following:
spark-shell --driver-class-path $(echo target/*/*.jar | tr ' ' ',')
Following example shows how you can read and manipulate the supported dataset, in this case, the titanic dataset.
import com.github.dongjinleekr.spark.dataset.Titanic._
val spark = SparkSession
.builder()
.appName("Spark Dataset Example")
.getOrCreate()
import spark.implicits._
import Titanic.implicits._
// Read dataset as DataFrame.
val df = spark.read
.schema(Titanic.schema)
.option("header", true)
.csv("hdfs:///datasets/titanic/data.csv")
df.show(10)
// Convert DataFrame to DataSet.
val ds = df.as[Passenger]
ds.show(10)
ds.printSchema()