Challenge: Collecting data

by Jean-Christophe Meunier, Noah Alvarez Gonzalez & Joffrey Bienvenu.

Challenge's summary

The mission is: To collect as much data as possible about the market price of real estate in Belgium, in order to build a dataset that can be used later to create an AI.

Constraints:

Get data from all over Belgium.
Deliver a .CSV file with a minimum of 10 000 entries.
No empty fields.
No duplicates.
Always record numerical values if possible.

Objective:

Create a program capable of scraping one (or more ?) real estate websites while respecting all constraints.

The target: Zimmo.be - Why ?

We have chosen to scrapp zimmo.be for the following reasons:

It contains more than 100,000 real estate's advertisements in Belgium.
So, we can get data for all over Belgium without having to scrapp multiples local agencies' website.
This website is easy to scrapp:
- No Javascript: All the data is available in the HTML code of the site as soon as the page is loaded. There is no Javascript that delays the loading of these data. There is no button we have to click on to access these data.
  So, it is possible to scrapp this website without Selenium (no web browser), simply with WebRequests; this can increase the speed of our program !
- A well structured HTML: The data is encapsulated in clear html tags, with identical tags and attributes each time the page loads.
  So we can just use the simplests recovery methods of BeautifulSoup.
- A well structured website: The entry point for our scrapper is this page: https://www.zimmo.be/fr/province/. By reading the URL, you can guess what it may contain. And this page regroups all the real estate selling offers of the website, classified by regions.
  The offers' links are very clear too: https://www.zimmo.be/fr/borgerhout-2140/a-vendre/maison/JP4OF/, they contain the city, the postal code, the type of offer (for sale/to rent), the type of property (house/apartment), and a unique identification code.
  It is therefore possible to apply a filter directly on URLs, thus preventing our program from browsing unnecessary pages.
- Bot compliant: The site is poorly protected against bots, and responds well to the huge amount of requests send by our program. For the 20,000 + pages we scrapped, we had to complete a total of 4 captchas.
  We didn't had to implement a strong anti-banning strategy.

The program: A Scrapping Bot

What does our program do ?

Based on the challenge's constraints, we wanted a program capable of:

Scrapp zimmo.be.
Able to scrapp other websites later.
Work with captchas.
Clean up data and complete missing data.
Deliver a CSV that meets the customer's specifications.
Backup data in case of a crash.

Integrated concepts:

In order to practice, we tried to integrate the concepts seen during the last two weeks into the program, such as:

Object-oriented programming.
Threading.
Scrapping with WebRequest/Selenium and BeautifulSoup.
Regular Expressions.
Typing.
Data manipulation with Pandas' Dataframes.
File creation and crash recovery.
Decorators (This goal was not achieved).

A picture is worth a thousand words:

Here is the architecture of our program:

Module
Object
Threaded Object

How does it works ?

Our program is divided into three modules:

A Scrapper (for Zinno.be), in charge of scrapping the data.
A Cleaner, which cleans the data and makes backups.
A Merger, which transforms all backup files into a .CSV.

The Object Data Collector coordinates the instantiation of the modules and the transmission of data between them.

Why this architecture ?

The two strong points of this architecture are:

Being able to scrapp all the sites we want: The Zinno.be scrapper is an interchangeable module.

To scrap another website, we just need to implement another module (example: An Immoweb scrap module), and connect it to the rest of the program. No need to write a new program from 0, nor to modify the Cleaner or Merger.
Being able to deliver one .CSV in the desired format to the customer: Some customers want numeric values everywhere, others prefer True/False with strings, ... Our program can do that easily !

To be able to deliver a .CSV formatted on demand without having to scrapp the whole website again, our program saves the scrapping data in a "backup" folder with pickles files. It backs them up as it retrieves them from the site. Once all the data is recovered, it transforms it into .CSV via the Merger.

Advantage: This architecture also avoids us to have to start to scrapp from the beginning if the program crashed (internet connection down ? Ban ?). As all data is saved in files, we can simply resume from where the program stopped before.

All Objects definitions:

Data Collector: Instantiate the Manager. Collect the raw data from the Manager and send it to the cleaner. When the scrapp is complete, is instantiate a Merger.
Manager: It connects the UrlGrabber and the Scrapper to the Collector.
UrlGrabber: It retrieve all real estate advertisements' URL from zimmo.be
Scrapper: Once the job of the UrlGrabber is complete, it scrapp the data of each given URLs.
WebDriver: It initialize a custom version of Selenium Webdriver, with proxy, Javascript and images disables, AdBlock activated.
Requester: It initialize a custom version of Request. Currently, this class is not used.
Cleaner: It cleans and normalizes the raw data send by the Collector. The data is then put in a Dataframe and send to the Saver.
Saver: It save a given dataframe to a pickle file, in the ./backup/ folder.
Merger: It retrieve all pickles files from the ./backup/ folder. Then, depending on client's needs, it apply some filter and save everything in a .CSV file.

How to deal with Captchas ?

Currently, Captchas are the reason why we are not scrapping zimmo.be through Request. Here is how we deal with them:

The Webdriver is set to wait up to 24h for a humain intervention when a Captcha appear
The humain solve the captcha
The program continue to scrapp

TO DO - Future improvements:

Scrapping Zimmo.be through Request only (No more Selenium): To speed up the scrapping, we need to reimplace Selenium by Request. But we need to implement something to detect when the Request face a Captach, and open the webpage through Selenium for us to solve the captcha.
Cleaning and refactoring the code: Short delays to deliver this programm caused some part of the code to be a bit dirty. A good refactor is necessary before implementing another scrapping module.
Further pre-processing of the data: Some advertissement have missing fields. The data could be scrapped from the text written by the seller. But this require a more complicated implementation of the Scrapper.
Further post-processing of the data: Implementing a new scrapping module for a new website (like immoweb) could lead to a rewrite of the Cleaner.

The CSV:

We managed to scrap 21 825 selling offers from zimmo.be. See the CSV.

CSV architecture:

locality: str
type_of_property: str = "house" | "appartment"
subtype of property: str
price: int
sale_type: str "agency" | "notarial"
num_rooms: int
area: int
kitchen_equipment: none, equipped
furnished: bool
open_fire: bool
terrace: bool
terrace surface: int
garden: bool
garden surface: int
surface_land: int
surface_plot_land: int
number_of_facades: int
swimming_pool: yes/no
state_building: str = "new" | "to be renovated"

Task distribution:

Joffrey: General design of the POO structure, definition of classes, collecting urls and websites references for including in the algorithm, running the program to collect the data.
Noah: Collecting urls and websites references for including in the algorithm, implementation of the UrlGrabber (e.g. using BeautifulSoup, regex, etc.), running the program to collect the data.
Jean-Christophe: collecting urls and websites references, to include in the algorithm, code programming for collecting requested category (e.g. using BeautifulSoup, regex, etc.), running the program to collect the data, assist to coding command related to the pandas library.

Joffreybvn/bot-scrape-zimmo