Open source WebCrawler

Simple web crawler based on Java 11 open JDK.

Overview

Features/Supports

Deep search
Max visited pages
Sync/Async worker
Terms search sorting by total hints
Exports statistics to csv file or to separate csv files

Installation

Download latest realise files.
Check crawl settings in crawlsearcher.txt.
Run WebCrawler.jar
When crawler finishes its work, you will see reports in the csv files if you included -csv flag.

crawlsearcher.txt allows you to change:

- root seed
- depth
- max visited pages
- sync/async worker strategy
- st or et params mean start terms & end terms, so these params run term searcher
- csv export statistic to csv files

Development

Want to contribute? Great!

Fork the project & clone locally.
Create an upstream remote and sync your local copy before you branch.
Branch for each separate piece of work.
Do the work, write good commit messages.
Push to your origin repository.
Create a new PR in GitHub.
Respond me to code review feedback.

If you want to contribute to an open source project, the best one to pick is one that you are using yourself. The maintainers will appreciate it!

Ivanovskij/WebCrawler

Open source WebCrawler

Overview

Features/Supports

Installation

Development