Smart web crawler.
A smart web crawler that fetches data from a website and stores it in some way (writes it in files on the disk or POSTs it to an http endpoint etc) .
More options for crawling:
-
crawl the links from a
sitemap.xml
-
crawl the website as a graph starting from a given url (the index)
-
crawl with retrial if any
RuntimeException
happens etc
More details in this post.
Get it using Maven:
<dependency>
<groupId>com.amihaiemil.web</groupId>
<artifactId>charles</artifactId>
<version>1.1.1</version>
</dependency>
or take the fat jar.
Charles is powered by Selenium WebDriver.
Any WebDriver implementation can be used to build a WebCrawl
Examples:
- PhantomJsDriver
- FirefoxDriver
- ChromeDriver etc
Since it uses a web driver to render the pages, also any dynamic content will be crawled (e.g. content generated by javascript)
Read this post.
- Open an issue regarding an improvement you thought of, or a bug you noticed.
- If the issue is confirmed, fork the repository, do the changes on a sepparate branch and make a Pull Request.
- After review and acceptance, the PR is merged and closed.
- You are automatically listed as a contributor on the project's site
Make sure the maven build
$mvn clean install -Pitcases
passes before making a PR.
In order to run the integration tests you need to have PhantomJS installed on your machine and set the JVM system property phantomjsExec
to point to that location. By default the exe is looked up at /usr/local/bin/phantomjs
(linux), so if it's not found the tests won't work.