Scrapy crawler developed by Samuel Abolo for the Quibble Python developer Challenge
As a result of Airbnb being a dynamic site with most of its content loaded from JavaScript, there where two options on how to approach this.
The first was to use frameworks like scrapy Splash or Selenium to fully load the site and execute the javascript code. The disadantage of this is that setting up these frameworks are heavy and makes the system more complex
The second approach which I picked was to figure out the endpoint the JavaScipt was sending it's requests to and scrape from there directly.
- I noticed that the name of the owner of the listings where sometimes left blank
- I noticed that more information about the ratings, such as [ratings for accuracy, checkin, location, etc] will have to be scraped from the room link itself. this can be done with Splash.
- I needed to find a way to handle pagination, this was done through some important pagination info in the response
- Clone this repo
- Create a virtual environment with
python -m venv env
orpython -m virtualenv env
- Activate your virtual environment with
source ./env/bin/activate
for linux or.\env\Scripts\activate
for windows - run
pip install -r requirements.txt
- create a
.env
file and addMONGO_URI=<your-mongodb-connection-string>'
- cd into the folder airbnb
- run
scrapy crawl listings