/DSCI631-project

Machine learning project to predict NYC property prices.

Primary LanguageJupyter NotebookMIT LicenseMIT

NYC Property Sales

by: Aviv Farag, Joseph Logan, Abdulaziz Alquzi


Table of Contents


Abstract:

We are data science consultants who are contracted by property management investors in New York City. Their company, supported by investors, wants to buy residential real estate in NYC at the cheapest price possible, renovate, then resell within a year. The renovation analysis is outside the scope of this project, but they want a baseline model that can predict the price of residential real-estate in order to :

Identify potential undervalued listed properties to buy Predict market price when it’s time to sell in order to sell quickly while maximizing return on investment Because the want to renovate and sell the properties quickly, they want less than 10 residential units, and properties less than 5 million each but are at least ten thousand.


Python Packages:

  1. pandas
    import pandas as pd

  2. numpy
    import numpy as np

  3. matplotlib.pyplot
    import matplotlib.pyplot as plt

  4. joblib
    import joblib

  5. seaborn
    import seaborn as sns

  6. scipy.stats.randint
    from scipy.stats import randint

  7. sklearn:

    1. sklearn.metrics:

      1. mean_squared_error
      2. mean_absolute_error
      3. r2_score
      4. confusion_matrix
    2. sklearn.ensemble:

      1. RandomForestRegressor
      2. BaggingRegressor
    3. sklearn.model_selection:

      1. train_test_split
      2. GridSearchCV
      3. RandomizedSearchCV
      4. cross_validate
      5. KFold
    4. sklearn.preprocessing:

      1. StandardScaler
      2. OneHotEncoder
      3. RobustScaler
    5. sklearn.linear_model: LinearRegression

    6. sklearn.model_selection: train_test_split

    7. sklearn.pipline: Pipline

    8. sklearn.compose: ColumnTransformer

    9. sklearn.decimposition: PCA

    10. sklearn.dummy: DummyRegressor

    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import RobustScaler, StandardScaler
    from sklearn.compose import ColumnTransformer
    from sklearn.linear_model import LinearRegression
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    from scipy.stats import randint
    from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
    from sklearn.dummy import DummyRegressor
    from sklearn.tree import ExtraTreeRegressor
    from sklearn.model_selection import cross_validate, KFold
    

Functions

  1. random_SCV(pipe = [], grid_param = [], n_iter = 10, cv = 5, scoring = 'neg_mean_squared_error', rnd_state = 42, file_name = "", training = [])
    Running RandomizedSearchCV for an estimator "pipe" according to grid_param and the other parameters including a list of x_training and y_training (training). The results are saved in param_tuning folder in the file named: file_name.

  2. grid_SCV(pipe = [], grid_param = [], cv = 5, scoring = 'neg_mean_squared_error', file_name = "", training = [])
    Similar to the first function, but this time it is GridSearchCV that runs on an estimator "pipe".

  3. wr_pkl_file(file_name = "",content = "", read = False)
    Dealing with either reading or writing a pkl file that contains different machine learning pipelines with their corresponding results.

  4. print_results(labels = [], est = [], plt_num = 50, log = False, testing = [])
    Predicting sales prices and printing results (R-Squared, MAE, and RMSE) for different estimators (est).

  5. validation(models = [], estimators = [], training = [], cv = 5, train_score = False):
    Performs cross validation for different models using their estimators and training set.


Setup and running the code:

Clone the repo using the following command in terminal:
git clone https://github.com/avivfaraj/DSCI631-project.git

After cloning the repo, open Final_project.ipynb and run each cell one at a time in the order that they are presented. You can run the whole notebook in a single step by clicking on the menu Cell -> Run All.

The first two sections are packages and functions which are required for the code to run. Make sure to run those two sections before running the program.


Acknowledgements

Dataset was found at Kaggle.
The origin of the data in this dataset is NYC Department of Finance Rolling Sales