WebCrawler - Group 12

About / Usage

This Java application serves as a robust web crawler, designed to traverse websites starting from one or more user-provided URLs and explore them up to a specified crawl depth.
This application accepts several command-line arguments to control its web crawling and processing behavior:
- URL:
  - Description: The starting URLs from which the crawler will begin its operation. Up to 10 URLs can be crawled at the same time.
  - Command: Enter the URLs when prompted: Please enter the URLs (comma-separated, no spaces)
  - Example: https://example.com,https://www.example2.at
- Crawl Depth:
  - Description: The maximum depth the crawler will traverse from the starting URL.
  - Command: Specify the depth when prompted: Please enter the Crawl depth
  - Example: 3
- Domains to be Crawled:
  - Description: Limits the crawler to specific domains to avoid wandering off to unwanted areas of the internet. Leaving it blank will let the crawler accept any domain.
  - Command: Enter the domains when prompted, separated by commas: Please enter the domains to be crawled (comma-separated, no spaces):
  - Example: example.com,sub.example.com
- Additional links depth (Optional):
  - Description: The maximum depth the crawler will traverse for the additionally found URL.
  - Command: Specify the depth when prompted: Define the depth for additional links
  - Example: 2 | Default: 2
- Path for the MD file (Optional):
  - Description: Allows the crawler to store the summary file in a specific path. If no path is defined the results are being stored in a temp directory.
  - Command: Enter the path when prompted, Enter the path where the .md File should be stored. Will be stored under temp as per default:
  - Example: C:\Users\User\AppData\Local\Temp\tempFolder3885246162413379470\

Run / Setup

Which website can You use?

Test Website: https://webscraper.io/test-sites/e-commerce/allinone
Arguments : arg1=URL; arg2:Depth; arg3:Domains; arg4:Additional Links Depth; arg5:Path;

Through IntelliJ

IntelliJ IDEA is an integrated development environment written in Java for developing computer software written in Java, Kotlin, Groovy, and other JVM-based languages. Key Features:
When the project is created, in the Project tool window (Alt 01), locate the src | main | java | Main.java file and open it in the editor. In the editor, click the gutter icon to run the application and select Run 'Main.main()'. IntelliJ IDEA runs your code. After that, the Run tool window opens at the bottom of the screen.

Features

The crawler implements the following features:

input the URL, the depth of websites to crawl, and the domain(s) of websites to be crawled
create a compact overview of the crawled websites
record only the headings
represent the depth of the crawled websites with proper indentation (see example)
record the URLs of the crawled sites
highlight broken links
find the links to other websites and recursively do the analysis for those websites (it is enough if you analyze the pages at a depth of 2 without visiting further links, you might also allow the user to configure this depth via the command line)
store the results in a single markdown file (.md extension)

AdinHal/webcrawler-group12

WebCrawler - Group 12

About / Usage

Run / Setup

Features