Web Crawler

Created by Jeremy Wood and Elijah Poulos

Building

The program can be built simply with gradle via the terminal. No additional installation is necessary, however, JAVA_HOME must be set properly in your environment variables.

To build the application, simply run:

gradlew clean build

from a command line. This will create webcrawler.jar in the project root directory.

Testing

The unit and acceptance test suite are run along with the normal build process. However, if you wish to run this separately, simply run:

gradlew clean test

NOTE Running the test suite launchers a web server on the default port of 8081. If this port is not open, the tests will fail. To change this port you must add the following system property to the build:

-DtestWebServerPort=<portnum>

Example:

gradle clean build -DtestWebServerPort=8083

Usage

The Web Crawler requires the following three command line arguments.

a valid URL
a specified maximum page depth in the form of a natural integer.
a file path to the desired local destination directory for the files to be downloaded by the crawler.

For example:

http://website.com 3 C://Users/user/Desktop/downloadRepo

Run the program from the command line. If following the build instructions above, example usage would look like:

java -jar webcrawler.jar <url> <depth> <destination folder>

Logging output

By default, only a few messages will be displayed to the console while the crawler is running. If you would like to view more or less messages, you can set the log level like so:

java -DlogLevel=<level> -jar webcrawler.jar <url> <depth> <destination folder>

Where level can be either TRACE, DEBUG, or INFO. TRACE will show the most output while INFO will show the least.

How it works

The crawler works by parsing the first page of the given URL for links to other web pages and files/images in the pages HTML by using regex to match <a> and <img> tags, extracting their urls. Any <a> tags must be further parsed to determine if the url points to another page or a downloadable file.

Next, the staged web elements are classified as either "WebPage", "WebImage", or "WebFile". Images and files are added to DownloadRepository, while links to other pages are set as the new url for the program to crawl.

The main class keeps track of the current depth, and once this matched the desired depth, the crawler stops at the bottom of the deepest pages.