Experimenting with databricks on the titanic dataset

I have been following Databricks' technologies for a while without actually trying them. They aim to unify data engineering and data analytics and are built around Spark, and have a few exiting tools that promise to simplify data exploration, modeling, as well as productionising a machine learning pipeline (spark pipelines), tracking models (mlflow), optimising performance (delta lake)... I recently got the opportunity to start courses on how to use these technologies, and I thought it was time for me to git it a try.

To be able to test it, I needed a dataset that I both knew and for which I could find resources. Therefore, I chose Kaggle's titanic dataset, and I had a go at it with databricks community edition.

gtregoat/Kaggle-s-titanic-with-databricks

Experimenting with databricks on the titanic dataset