/spark-sklearn-airbnb-predict

Code example to predict prices of Airbnb vacation rentals, using scikit-learn on Spark with spark-sklearn, on MapR.

Primary LanguageJupyter Notebook

spark-sklearn-airbnb-predict

Code example to predict prices of Airbnb vacation rentals, using scikit-learn on Spark.

The Jupyter notebook in this repo contains examples to run regression estimators on the Inside Airbnb listings dataset from San Francisco. The target variable is the price of the listing. To speed up the hyperparameter search, the notebook shows examples that use the spark-sklearn package to distribute GridSearchCV across nodes in a Spark cluster. This provides a much faster way to search and can lead to better results.

To run the scikit-learn examples (without Spark) the following packages are required:

  • Python 2
  • Pandas
  • NumPy
  • scikit-learn (0.17 or later)

These can be installed on the MapR Sandbox.

To run the scikit-learn examples with Spark, the following packages are required on each machine:

  • All of the above packages
  • Spark (1.5 or later)
  • spark-sklearn -- follow the installation instructions there

You can run this on a MapR cluster by following one of these methods:

Run the script with:

MASTER=yarn-client /opt/mapr/spark/spark-1.5.2/bin/spark-submit --num-executors=4 --executor-cores=8 python_scikit_airbnb.py

(setting num-executors and executor-cores to suit your environment)

The file classify.py in this repo contains an example of classification on the same dataset, using reviews.csv and text analysis.

and of course... have fun!