several use-cases for playing around with Apache Spark.
In this project there is a mini Proof of concept using Apache Spark Dataframes.
Contains to tests:
- JoinUsingPlainRdd, a join of a table (file gz of about 10mb) with itself, using plain RDDs.
- JoinUsingDataFramesMain, the same join, but using DataFrames instead of RDDs. -This uses spark-csv (a thirty party library) for import the file with its schema.
Observations: Join using Dataframes runs in less than half time the join using RDDs.
Come and see...