Simple crawler that starts from a URL, determines the 10 most important words on that page (according to a simplistic heuristic) and crawls the hyperlinks that it finds. It then repeats this process with those URLs.
- URL of a page to start (including protocol)
- Maximum number of levels (how deep the crawler can go)
- Maximum execution time in seconds
The program writes the results into two files:
└───logs
│ │ log.txt
│ │ log.csv
They both contain the same information. Each line represents a visited URL and mentions:
- the URL
- the level of that URL
- how many children URLs were put in the queue while scraping that URL
- 10 most important words