Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that use JavaScript.
Add the following dependency to your pom.xml:
<dependency>
<groupId>com.github.peterbencze</groupId>
<artifactId>serritor</artifactId>
<version>1.6.0</version>
</dependency>
Add the following dependency to your build.gradle:
compile group: 'com.github.peterbencze', name: 'serritor', version: '1.6.0'
The standalone JAR files are available on the releases page.
The BaseCrawler
abstract class provides a skeletal implementation of a crawler to minimize the effort to create your own. The extending class should define the logic of the crawler.
Below you can find a simple example that is enough to get you started:
public class MyCrawler extends BaseCrawler {
private final UrlFinder urlFinder;
public MyCrawler(final CrawlerConfiguration config) {
super(config);
// Extract URLs from links on the crawled page
urlFinder = UrlFinder.createDefault();
}
@Override
protected void onPageLoad(final PageLoadEvent event) {
// Crawl every URL that match the given pattern
urlFinder.findUrlsInPage(event)
.stream()
.map(CrawlRequest::createDefault)
.forEach(this::crawl);
// ...
}
}
By default, the crawler uses HtmlUnit headless browser:
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
.setOffsiteRequestFiltering(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
.build();
// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);
// Start it
crawler.start();
Of course, you can also use any other browsers by specifying a corresponding WebDriver
instance:
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
.setOffsiteRequestFiltering(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
.build();
// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);
// Start it
crawler.start(new ChromeDriver());
That's it! In just a few lines you can create a crawler that crawls every link it finds, while filtering duplicate and offsite requests. You also get access to the WebDriver
instance, so you can use all the features that are provided by Selenium.
If this framework helped you in any way, or you would like to support the development:
Any amount you choose to give will be greatly appreciated.
The source code of Serritor is made available under the Apache License, Version 2.0.