crawly crawls the web from a set of seed urls. It sends a request to the urls, parses the urls from the response received, stores them in a repository and prints them to STDOUT as it fetches them. If number of urls to be fetched is specified, it stops crawling after successfully fetching specified number of urls.
- Python 2.7.10
- Python Module - Beautiful Soup 4.6.0
- Python Module - Requests 2.14.2
$ python main.py [-h] [-s SEED_URLS [SEED_URLS ...]] [-c COUNT]
optional arguments:
-h, --help show this help message and exit
-s SEED_URLS [SEED_URLS ...], --seed_urls SEED_URLS [SEED_URLS ...]
Set of seed urls
-c COUNT, --count COUNT
Number of links to be fetched
The following command starts crawling from a set of urls [https://www.python.org
, https://docs.python.org
] and stops when 10
urls are successfully fetched. If no seed url is specified, it takes https://www.python.org
as the default seed url. If no count is specified, it infinitely crawls the web until it receives a keyboard interrupt.
$ python main.py --seed_urls 'https://www.python.org' 'https://docs.python.org' --count 10
All logs (debug, error, info) generated during the execution of the program are stored in logs/crawly.log
.
-
make clean
Clears all the.pyc
and.log
files generated during execution of the program. -
make clean-logs
Clears only the.log
files generated during execution of the program. -
make clean-pyc
Clears only the.pyc
files generated during execution of the program. -
make run
Executes the program takinghttps://www.python.org
as the default seed url and crawls until it receives a keyboard interrupt.
- Multithreaded or distributed crawler that issues many HTTP requests in parallel
- Obey
robots.txt
before crawling a website - Skip fetching image, video and document urls