Build Status

Html Analyzer

The aim of the project is simply to analyze the html document for the any given url. The analyze contains the following points:

  • Html Document Type:
    The analyzer tries to find the document type declaration line and to match the document type by some defined Regular Expressions. HTML5, XMTML and HTML4 types are supported.

  • Page Title:
    The analyzer tries to find the page title.

  • Heading level occurrences:
    The analyzer tries to count the occurrence of the headings by their levels.

  • Hypermedia links:
    The analyzer tries to count the internal and external hypermedia links. a[href], link[href], area[href], img[src], script[src] are supported.

  • Login form detection:
    The analyzer tries to determine whether a login form exists in the page. There are two logic defined to determine the login form. The page should contain at least a form element:

    • If the form element contains ONLY one password field then it is assumed as a login form.
    • If the following rules apply to the form then it is assumed as a two step login form.
      • The form method is POST.
      • The action contains a specific keyword defined in a dictionary. (The dictionary file can be found under resources folder)
      • The form contains ONLY one text field and its name (if name is not defined then id) contains a specific keyword defined in a dictionary. (The dictionary file can be found under resources folder)

How to run

Run the any of the following commands in the project directory.

  • Maven:
mvn clean compile exec:java
  • Docker:
# Build the docker image
docker build -t htmlanalyzer .
# Run the image
docker run -p 4567:4567 htmlanalyzer:latest
  • Java command:
mvn clean package
java -jar ./target/htmlanalyzer-jar-with-dependencies.jar

The web server listens on the port 4567. Please open the the following url in your browser.


How to test

Run the following command in the project directory.

mvn clean test

Please see the code covarage here.


  • Html analyzing can be improved by rendering javascipt.
  • The document type detection can be improved by implementing more types.
  • Hypermedia detection can be improved by adding detection for embedded object and applet types.
  • Internal hypermedia detection can be improved by detecting sub domain.
  • Two step login check can be improved by checking two text input fields exists and their name exist in the dictionary. (e.g. a form contains two text field like username and email)
  • A configuration solution can be introduced in order to store http listen port, dictionary file locations logging configuration or http url connection configuration, etc.
  • Test coverage can be improved to cover all the source.


  • Java 8
  • jsoup (1.10.1) library (for html parsing)
  • sparkjava (2.5.2) library (for simple web application)
  • apache commons-io (2.5)
  • junit (4.12), mockito (2.0.42-beta), powermock (1.6.6)
  • Bootstrap Framework (3.3.7) & Jquery (3.1.1)