/pyspark-ml-hello-world

A minimal machine learning application using PySpark

Primary LanguageJupyter NotebookMIT LicenseMIT

PySpark ML 'Hello World'

Houses side by side

License CC BY 2.0 license: by ell brown

A simple repository to apply Linear Regression using PySpark to predict the houses' price.

I tried not to use operations apart from PySpark that require processing all the data, for example, plotting the features (I processed it in PySpark and used the result in the graphs). This makes it more realistic when you have a larger set of data than you can plot all the elements.

Of course, it is totally viable to use Pandas and Scikit-learn to create the regression model and predict, this repository is just the first step to understanding PySpark Framework.

Jupyter Notebook: PySparkML

Dataset link: https://www.kaggle.com/datasets/yasserh/housing-prices-dataset/data

Python Dependencies

$ pip install pyspark matplotlib seaborn numpy scipy
  • Remember to create a virtual environment first!

MIT License
This project is licensed under the MIT License - see the LICENSE file for details.

© 2023 BrenoAV