This project was created by Amandine Allmang, Nicolas Bourriez, Marie Bouvard, Jules Giraud, Arthur Nardone as a part of the Ensemble Learning course of the MSc AI at CentraleSupelec.
The goal of this project was to predict AirBnB prices in New York using ensemble learning methods. The code is written in Python and uses popular machine learning libraries such as Scikit-learn and XGBoost.
The dataset used in this project is the Airbnb New York City Airbnb Open Data dataset, which contains detailed information on Airbnb listings in New York City, including price, location, room type, and availability.
To run this code, you need to have Python 3 and the following Python libraries installed:
- pandas
- numpy
- matplotlib
- plotly.graph_objects
- seaborn
- scikit-learn
- xgboost
To install these libraries, you can use pip by running:
pip install pandas numpy matplotlib seaborn scikit-learn xgboost plotly
You can run the code by opening the Jupyter Notebook AirBnB_boosting_models.ipynb
and AirBnB_bagging_models.ipynb
and executing the cells. The notebooks contain the training of the various models and visualizations of the results. To do so, the notebooks use the different helper methods present in the Python (.py) files.
The code consists of the following files:
- In the
Data_Visualization
folder, containingData_Analysis_Price_vs_Rest.ipynb
andUni_Variable_Data_Exploration.ipynb
, we explored the data and looked at the distribution of samples for each feature as well as the interaction of each feature with the price of the listing. - In
utils.py
, we take care of data cleaning, preprocessing and evaluation. In this step, we load the dataset, handle missing values, and perform some feature engineering to prepare the data for modeling. - In
bagging.py
andboosting.py
, we can acces functions that create bagging and boosting models respectively. evaluation_plots.py
contains the functions that allow us to visualize our results.AirBnB_bagging_models.ipynb
andAirBnB_boosting_models.ipynb
can be run to train your own Decision Trees, Random Forest, Extremely Randomized Trees, Gradient Boosting, XGBoost and Adaboost models. They currently contain our best results and corresponding plots.
The dataset used in this project is the Iris for classification task and Diabetes for regression task, both of the datasets come from scikit-learn standard datasets.
To run this code, you need to have Python 3 and the following Python libraries installed:
- pandas
- numpy
- scikit-learn
To install these libraries, you can use pip by running:
pip install pandas numpy scikit-learn
You can run the code by opening the Jupyter Notebook cart_test.ipynb
and executing the cells. The notebooks allow you to test our regression and classification implementation. To do so, the notebooks use the different helper methods present in the Python (.py) files.
The code consists of the following files:
- The
CART
folder, containingsrc
folder withclasses
folder inside. - Inside
classes
folder, you can findbase_tree.py
andnode.py
. - Inside
classes
folder, you can findclassification_tree.py
andregression_tree.py
- There is also a notebook called
cart_test.ipynb
.
Contributions to the codebase are welcome. If you find any issues or have ideas for improvement, feel free to open an issue or pull request.