Varshul/async-crawler

A recursive async crawler which creates a graph of connected webpages

Python

Async Recursive Crawler

This is a simple crawler that crawls webpages according to the regex provided, starting from the given url, and crawls till the max depth given. It uses the new async/await coroutines introduced in PEP 492.

Todo

create a network visualization with the data saved
convert mongodb operations to bulk update

Stats

These tests were run on a free tier AWS EC2 server with this starting url.
Current results :

Time Taken for 494 requests(recursion level 1) : 5.484668092802167 sec
Time Taken for 36997 requests(recursion level 2) : 415.45510824956 sec

Dependencies