The goal of this project is to create price-prediction models for Airbnb accommodations located in Oslo, Norway.
The underlying data is provided by the Inside Airbnb Project and freely available on the web. This repository was created in context of the Statistical and Deep Learning seminar offered by the Chair of Statistics at the University of Göttingen during Winter Term 2021/22. Thus, the project focuses on the application of modern Machine Learning and Deep Learning methods that we implemented via the popular Python libraries scikit-learn and pytorch.
The results of this project including visualizations and interpretations of our findings are presented in written form in the term-paper/paper.pdf
file with corresponding slides in the term-paper/presentation.pdf
file.
In order to use the airbnb_oslo
package you can clone the repository from GitHub and install all required dependencies with
pip install .
If you want to contribute to development please run the command
pip install -e .[dev]
to install the airbnb_oslo
package in editable mode with all development dependencies.
Finally, in order to recreate the data sets in the data
folder, you can install the full set of dependencies with
pip install -e .[dev, data]
Remarks:
-
The two notebooks
cnn_colab.ipynb
andcnn_pretrained_colab.ipynb
in thenotebooks/colab
folder require GPU computations and are thus run on Google Colab. To store the results from these notebooks, we connected Google Colab with our individual Google Drive. Hence, when reproducing our findings, file paths need to be adjusted according to your configuration. -
The two notebooks mentioned above as well as the script
airbnb_oslo/models/cnn_visualization.py
file depend on the contents ofdata/clean/front_page_responses.pkl
which is not tracked by Git due to its large file size. Thus, you need to first run the scriptairbnb_oslo/data/scrape_pictures.py
to generate the data locally.
The project contains 4 relevant folders:
-
The
airbnb_oslo
package contains the core package functionality with all executable Python scripts.- The files in the
data
subfolder generate processed data files hat are stored in the top-leveldata
folder. - The
helpers
subfolder provides building block functions and classes for constructing the model architecture, fitting the models and extracting the results. These core components are used in multiple scripts inside themodels
subfolder. - Files in the
models
subfolder use the preprocessed data to constructscikit-learn
andpytorch
models. All results and visualizations of the term paper are produced here.
- The files in the
-
The
data
folder consists of raw and processed data files as well aspickle
files that store model outputs and model weights.After applying our models to the data for Oslo, one component of the course was to analyze the generalization performance to a new data set. For this task the Airbnb data for Munich was chosen. Hence the
data/munich
subfolder gathers equivalent raw and processed data files for Munich that are only relevant for this new prediction task and thus completely independent from our original analysis for Oslo. -
The
notebooks
folder contains all Jupyter Notebooks that show some of our models in action. These were majorly used during the development stage for exploratory purposes and to present intermediary findings to guide subsequent development steps. -
Finally, the
term-paper
folder contains all files that are immediately connected to the final term paper as well as the presentation. This folder is structured into achapters
subfolder, which collects all contents for the term paper that are later imported into the mainpaper.tex
file, as well as self-explainingimages
andtables
subfolders.