Information retrieval simple crawler project. This script will craw the web using an url
- Install docker.
- Install git.
- Clone this project.
git clone https://github.com/fahernandez/simple-crawler
- Execute
cd simple-crawler
docker run -ti -v $PWD/src:/src fahernandez/simple-crawler:latest --levels=20 --gigabytes=2 --restart=true
Usage: crawler.py [OPTIONS]
Options:
--gigabytes INTEGER Max number og gigabytes to be downloaded.
--url TEXT Page url to be crawled.
--levels INTEGER Maximum deeper level to be reach while crawling.
--restart BOOLEAN Restart the crawling process.
--help Show this message and exit.
Note: The crawling result will be save on file url.txt