/WebCrawler

simple web crawler java 11 lts

Primary LanguageJava

Java CI with Maven Code Coverage

Open source WebCrawler

Simple web crawler based on Java 11 open JDK.

Overview

Features/Supports

  • Deep search
  • Max visited pages
  • Sync/Async worker
  • Terms search sorting by total hints
  • Exports statistics to csv file or to separate csv files

Installation

  1. Download latest realise files.
  2. Check crawl settings in crawlsearcher.txt.
  3. Run WebCrawler.jar
  4. When crawler finishes its work, you will see reports in the csv files if you included -csv flag.

crawlsearcher.txt allows you to change:

- root seed
- depth
- max visited pages
- sync/async worker strategy
- st or et params mean start terms & end terms, so these params run term searcher
- csv export statistic to csv files

Development

Want to contribute? Great!

  1. Fork the project & clone locally.
  2. Create an upstream remote and sync your local copy before you branch.
  3. Branch for each separate piece of work.
  4. Do the work, write good commit messages.
  5. Push to your origin repository.
  6. Create a new PR in GitHub.
  7. Respond me to code review feedback.

If you want to contribute to an open source project, the best one to pick is one that you are using yourself. The maintainers will appreciate it!