Project of Distributed Data Mining with PySpark, an Apache Spark API for Python
We will analyze the World Earthquake dataset, which can be downloaded on the "kaggle" platform at this link: https://www.kaggle.com/datasets/danielpe/earthquakes
To accomplish the task, we will utilize some of the tools learned during the "Distributed Data Analysis and Mining" course. The main tasks we will tackle are:
• Data Understanding
• Data Preparation & Regression
• Clustering
• Classification
We will approach these tasks using the Python programming language with PySpark. PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core. We also used other libraries such as Pandas, Matplotlib, Seaborn, Folium, Graphviz, etc. always in the context of visualisation. All computations and machine learning algorithms were solely performed using Spark.