/simple-crawler

Information Retrival simple crawler project

Primary LanguagePythonMIT LicenseMIT

Simple crawler

Information retrieval simple crawler project. This script will craw the web using an url

Requirements

  1. Install docker.
  1. Install git.

How to run the proyect

  1. Clone this project.
git clone https://github.com/fahernandez/simple-crawler
  1. Execute
cd simple-crawler
docker run -ti -v $PWD/src:/src fahernandez/simple-crawler:latest --levels=20 --gigabytes=2 --restart=true

Options

Usage: crawler.py [OPTIONS]

Options:
  --gigabytes INTEGER  Max number og gigabytes to be downloaded.
  --url TEXT           Page url to be crawled.
  --levels INTEGER     Maximum deeper level to be reach while crawling.
  --restart BOOLEAN    Restart the crawling process.
  --help               Show this message and exit.

Note: The crawling result will be save on file url.txt