/crawler

Everybody must write at least one crawler

Primary LanguagePythonMIT LicenseMIT

Hyperlinks crawler

Requirements Status

Everybody must implement at least one crawler

Hyper-links Crawler

Traverse the Web as a linked graph from the starting --url finding
all outgoing links (<a> tag).
It will store each outgoing link for the URL, and then repeat the process
for each or them, until --limit URLs will have been traversed.

USAGE:
    hyperlinks [options] --url <start-url> --limit <limit>
    hyperlinks -h | --help
    hyperlinks --version

OPTIONS:
    -h --help             Show this screen
    --version             Show version
    --url <start-url>     URL where to start hyper-links crawling
    --limit <limit>       Limit of URLs to traverse
    --out <dest-file>     File path to the JSON file where to store output,
                          if not specified output JSON to STDOUT
    --pretty-print        JSON output will be pretty printed
    --dbout               Causes the data to be stored in a MongoDB collection
    --concurrent          Run crawler using async HTTP requests (experimental)

Initialization

To install dependencies in virtualenv and run tests use

start_from_scratch.sh

Running tests

make test