Crawler4j is a Java library which provides a simple interface for crawling the web. It's pretty configurable, although there are many opened issues and hanging pull-requests.
Project was built with:
- Java 14
- Gradle 6.3
Provide input properties app.properties
.
Build project:
gradle clean build
Test project:
gradle clean test
Run project:
gradle run
Crawling's result is sitemap.json
stored in crawledStorageFolder + resultsFilename
path. Example sitemap can be found in
project resources.
- usage of jsoup for some advanced extracting,
- dedicated interfaces for static resources/location.