/Distributed-Data-Analysis-And-Mining

Project of Distributed Data Mining with PySpark, an Apache Spark API for Python

Primary LanguageJupyter Notebook

Distributed-Data-Analysis-And-Mining

Project of Distributed Data Mining with PySpark, an Apache Spark API for Python

We will analyze the World Earthquake dataset, which can be downloaded on the "kaggle" platform at this link: https://www.kaggle.com/datasets/danielpe/earthquakes

To accomplish the task, we will utilize some of the tools learned during the "Distributed Data Analysis and Mining" course. The main tasks we will tackle are:

Data Understanding

Data Preparation & Regression

Clustering

Classification

We will approach these tasks using the Python programming language with PySpark. PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core. We also used other libraries such as Pandas, Matplotlib, Seaborn, Folium, Graphviz, etc. always in the context of visualisation. All computations and machine learning algorithms were solely performed using Spark.