App stores crawler

Crawler for apple and google stores to get ranking of the best apps by category.

Features

Concurrent: it uses python 3.5+ async.io lib to get results faster
Distributed: you can run the crawler from as many machines you want using a centralized db
Priority to download: as the stores have an infinity of apps, the most ranked apps are updated frequently. We don't need to download all the apps before start the updates.
Configurable random range of time to wait between requests to not flood app stores
Fault tolerant: what happens if the internet stop working for a minute? It handles the error and put the URL to be indexed again

How does it work?

Microservices

There are 2 microservices:

urldb: handles the queue of which urls must be downloaded, and saves new ones to the queue
appsdb: handles the apps creation and queries

Crawler

Crawls the categories of the app stores, and than ask for urldb which URLs are next in queue to be crawled. It also send new apps URLs to the urldb to be handled

How to run it?

First of all, create a virtualenv for python 3.6

virtualenv --python=python3.6 env
source env/bin/activate
pip install -r requirements.txt

Run appdb.py

source env/bin/activate
python appdb.py

Run urldb.py

source env/bin/activate
python urldb.py

Run as many crawlers.py as you want, but be carefull to not flood app stores and be blocked.

source env/bin/activate
python crawler.py

The UI

There is a simple UI using material design to query the results. Run a sample python server inside ui directory to make it run

python -m http.server

Open http://localhost:8000/ and there it is:

Tests

To run the tests, project root directory must be in PYTHONPATH:

PYTHONPATH=$PWD py.test

gabrielverta/appstores_crawler