/crawler

Simple java web crawler

Primary LanguageJavaApache License 2.0Apache-2.0

Java web crawler Build Status

Simple java (1.6) crawler to crawl web pages on one and same domain. If your page is redirected to another domain, that page is not picked up EXCEPT if it is the first URL that is tested. Basicly you can do this:

  • Crawl from a start point, defining the depth of the crawl and decide to crawl only a specific path
  • Output all working urls
  • Output the data to a csv file, separated by working (200 response code) and non working url
  • Output the data to two text files, one with working urls and one with none working. Each url will be on one new line.
  • Output url:s that contains a keyword in the html
  • Exprimental support for verifying that assets on a page work

How to crawl

A simple crawl have the following options, and will output the url:s crawled to system out. Note, only urls that returns 200 will be outputted by default:

usage: CrawlToSystemOut [-l ] [-np ] [-p ] -u  [-v ]
 -l,--level             how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath   no url:s on this path will be crawled [optional]
 -p,--followPath         stay on this path when crawling [optional]
 -u,--url                 the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify           verify that all links are returning 200, default is set to true [optional] 
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional]

You can choose to output the crawled list to two plain text files, one with working urls, and one with the none working:

usage: CrawlToFile [-ef ] [-f ] [-l ] [-np ] [-p ] -u  [-v ] [-ve ]
 -ef,--errorfilename    the name of the error output file, default name is errorurls.txt [optional]
 -f,--filename               the name of the output file, default name is urls.txt [optional]
 -l,--level                     how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath           no url:s on this path will be crawled [optional]
 -p,--followPath                 stay on this path when crawling [optional]
 -u,--url                         the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify                   verify that all links are returning 200, default is set to true [optional]
 -ve,--verbose                verbose logging, default is false [optional]
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional] 

You can choose to output the result in a csv file, and separate the urls by working and non working:

usage: CrawlToCsv [-f ] [-l ] [-np ] [-p ] -u  [-v ]
 -f,--filename        the name of the csv output file, default name is result.csv [optional]
 -l,--level              how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath    no url:s on this path will be crawled [optional]
 -p,--followPath          stay on this path when crawling [optional]
 -u,--url                  the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify            verify that all links are returning 200, default is set to true [optional]
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional] 

Crawl and output urls that contains specific keyword in the html

usage: CrawlToPlainTxtOnlyMatching -k  [-l ] [-np ] [-p ] -u  [-v ]
 -k,--keyword          the keyword to search for in the page [required]
 -l,--level              how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath    no url:s on this path will be crawled [optional]
 -p,--followPath          stay on this path when crawling [optional]
 -u,--url                  the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify            verify that all links are returning 200, default is set to true [optional]
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional] 

Configuration

There are also configuration that you either configure in the crawler.properties file or override them by adding them as a system property. By default they are configured:

## Override these properties by set a system property
com.soulgalore.crawler.nrofhttpthreads=5
com.soulgalore.crawler.threadsinworkingpool=5
com.soulgalore.crawler.http.socket.timeout=5000
com.soulgalore.crawler.http.connection.timeout=5000
# Auth like:
# soulislove.com:80:username:password,...
com.soulgalore.crawler.auth=
# Proxy properties, if you are behind a proxy.                                                                                                                                                          
## The host by this special format: http:proxy.soulgalore.com:80                                                                                                                                        
com.soulgalore.crawler.proxy=

The location of crawler.properties file can be set with the system property com.soulgalore.crawler.propertydir.

Examples

Checkout the project and compile your own full jar (all dependencies included):

git clone git@github.com:soulgalore/crawler.git

or add it to Maven, if you want to include the crawler in your project:

<dependency>
 <groupId>com.soulgalore</groupId>
 <artifactId>crawler</artifactId>
 <version>1.5.11</version>
</dependency>

Examples

Running from the jar, fetching two levels depth and only fetch urls that contains "/tagg/"

java -jar crawler-1.5.11-full.jar -u http://soulislove.com -l 2 -p /tagg/

Running from the jar, adding base auth

java -jar -Dcom.soulgalore.crawler.auth=soulgalore.com:80:peter:secret crawler-1.5.11-full.jar -u http://soulislove.com

Running from the jar, output urls in csv file

java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlToCsv -u http://soulislove.com

Running from the jar, output urls into two text files: workingurls.txt and nonworkingurls.txt

java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlToFile -u http://soulislove.com -f workingurls.txt -ef nonworkingurls.txt

Running from the jar, verify that assets are ok

java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlAndVerifyAssets -u http://www.peterhedenskog.com

License

Copyright 2014 Peter Hedenskog

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Bitdeli Badge