License CC BY 2.0 license: by ell brown
A simple repository to apply Linear Regression using PySpark to predict the houses' price.
I tried not to use operations apart from PySpark that require processing all the data, for example, plotting the features (I processed it in PySpark and used the result in the graphs). This makes it more realistic when you have a larger set of data than you can plot all the elements.
Of course, it is totally viable to use Pandas and Scikit-learn to create the regression model and predict, this repository is just the first step to understanding PySpark Framework.
Jupyter Notebook: PySparkML
Dataset link: https://www.kaggle.com/datasets/yasserh/housing-prices-dataset/data
$ pip install pyspark matplotlib seaborn numpy scipy
- Remember to create a virtual environment first!
MIT License
This project is licensed under the MIT License - see the LICENSE file for details.
© 2023 BrenoAV