fahernandez/simple-crawler

Information Retrival simple crawler project

PythonMIT

Simple crawler

Information retrieval simple crawler project. This script will craw the web using an url

Requirements

Install docker.

Docker

Install git.

Git

How to run the proyect

Clone this project.

git clone https://github.com/fahernandez/simple-crawler

Execute

cd simple-crawler
docker run -ti -v $PWD/src:/src fahernandez/simple-crawler:latest --levels=20 --gigabytes=2 --restart=true

Options

Usage: crawler.py [OPTIONS]

Options:
  --gigabytes INTEGER  Max number og gigabytes to be downloaded.
  --url TEXT           Page url to be crawled.
  --levels INTEGER     Maximum deeper level to be reach while crawling.
  --restart BOOLEAN    Restart the crawling process.
  --help               Show this message and exit.

Note: The crawling result will be save on file url.txt