
The repository for Scala Spark workshop held by Tenaris Data Science Department in universities

MIT LicenseMIT

Scala Spark Workshop

This repository collects the Databricks notebooks used in the Scala Spark workshop held at universities by Tenaris Data Science Department.


The repository contains two Databricks notebooks made for Databricks Community Edition. The aim is to teach Spark fundamentals to future Software Engineers.

One notebooks contains excercises to be completed by students, while the other contains the solutions.

Notebooks are in Italian and can run on Spark 2.0+ clusters. The previous edition of classes was based on Spark 1.6+: the code is still available under the branch spark_1.6.0.

Getting Started

Workshop Scala Spark Edition: Students should create their account on Databricks Community Edition and import the notebook published at https://raw.githubusercontent.com/tenaris/scala-spark-workshop/master/src/main/databricks/EsercitazioneScalaSparkNoSoluzioni.dbc

Workshop PySpark Edition: Students should create their account on Databricks Community Edition and import the notebook published at https://github.com/tenaris/scala-spark-workshop/raw/master/src/main/databricks/WorkshopPySpark_English_NoSolution_Cleaned.dbc

Dataset References

  • The Iris Plants Database by R.A. Fisher and made available by Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
  • The Italian 2016 Referendum dataset is freely available on the Eligendo portal, and licensed under the IODL 2.0 license.