Simple web crawler

Simple crawler that starts from a URL, determines the 10 most important words on that page (according to a simplistic heuristic) and crawls the hyperlinks that it finds. It then repeats this process with those URLs.

Command line args

URL of a page to start (including protocol)
Maximum number of levels (how deep the crawler can go)
Maximum execution time in seconds

Results

The program writes the results into two files:

└───logs
│   │   log.txt
│   │   log.csv

They both contain the same information. Each line represents a visited URL and mentions:

the URL
the level of that URL
how many children URLs were put in the queue while scraping that URL
10 most important words

joanasmramos/simple-web-crawler

Simple web crawler

Command line args

Results