API Real-Estate

Project Guidelines

Repository: challenge-api-deployment
Type of Challenge: Learning
Duration: 5 days
Deadline: 30/07/2021 16:00 (code)
Presentation: 02/08/2021 10:00
Team challenge : Solo

Technologies / Libraries

Python : A programming language
Numpy : The fundamental package for scientific computing with Python
Scikit-Learn : Machine Learning Library
Pandas : A fast, powerful, flexible and easy to use open source data analysis and manipulation tool
fastAPI : A modern, fast (high-performance), web framework for building APIs with Python 3.6+

Documentation

Objectives

This API can be used to get a predicted price for a given (hypothetical) property: it can be a house, an apartment, but also a farmhouse, etc. (many types are allowed by the API). In order to make a prediction, the API runs a model (more information on that in the last section) that needs some real-estate data.

API Framework

I used fastAPI rather than Flask because the integration with Pydantic Models is simply amazing. Almost all the validation takes place within the models (i.e. classes) definition: it's OOP, it's easy, readable and customizable.

In other words, there's no code needed to manually check the structure of the data, because the Pydantic library is doing that automatically, including returning some nicely formatted and explicit error messages. As an example, instead of receiving a head-scratching '500 Internal Error' in case you forgot to send a required field, fastAPI and Pydantic will automatically send a 422 Validation Error with a message saying exactly which field is missing!

And although it did not matter here, fastAPI can be used asynchronously, with performance on par with Node.js - which is great for people like me who don't know anything about JavaScript! If you are interested, go check fastAPI official documentation

Model Building

Data collection

Data about real-estate in Belgium was scraped from Immoweb around June 2021 with BeautifulSoup and Selenium libraires. The first one allows us to retrieve (static) information contained within the HTML code, while the second one provides automated interaction with the Website in order to click, scroll, wait for some JS script to be executed, fill in fields, etc. Collected data includes information about the size, the condition of the building, the postal code, the type of building, etc. Eventually only the relevant features for the model were kept, and are shown in the /docs and /redoc.

Part of the reason why predictions are not super accurate (as discussed below) is that collecting this Immoweb data presented some challenges. First, many properties were missing some fields that had some predictive values (or at the very least, we were not able to retrieve them from the page's code. Second, we did not know about regression and Machine Learning at that time; it's plausible that if we were to do it all over again, we would focus more on getting specific fields that in hindsight could help train the model.

Exploratory Data Analysis (EDA)

The first step - called EDA for Exploratory Data Analysis - was to find out which features seemed to be most correlated with the price, and how much variance ("variability") was affected by outliers in the relations between variables. A business-oriented presentation was given at this stage, if you are interested you can find the slides here.

The second step was to further clean the data: properties with too many missing fields were dropped and outliers were removed (about 0.1%).

The third step - called feature engineering - was to turn categorical features into numerical ones, to create a 'province' feature based on the postal codes (because it quickly appeared that clustering the data at the province level would yield better predictions), and finally to rescale the target (i.e. the price) with a logarithm to avoid too much "dispersion" at high prices.

Finally, after trying a few linear regressions, a non-linear estimator called a Random Forest Classifier was selected and trained.

For more information on what it means to "train a model", you can check this video. If you don't enjoy the author's singing, you can check this other video as well.

For more information on Random Forest in general, you can check this article.

How well is it working?

Unfortunately, not that good right now... It can be pretty spot on, but it can also be off by more than half a million for the most expensive or peculiar properties. Usually though, the predicted price is in acceptable range and kind of make sense. On a more technical level, the Adjusted R2 score is about 0.70.

Usage

You can use the endpoint /predict/ (with a POST request) to send a properly formatted JSON object with all the required fields and their respective values (and yes, those are case sensitive). To help you figure out what the API expects from you, you can check 2 super important links.

The first one allows you to check all the acceptable values for all the required fields. Again, the string values are case-sensitive!

The second one allows you to try the API by directly typing some values in a JSON object and see what response and potential error message you get with the values you've entered. It's completely safe, don't hesitate to try different combinations.

CorentinChanet/challenge-api-deployment