-
This Java application serves as a robust web crawler, designed to traverse websites starting from one or more user-provided URLs and explore them up to a specified crawl depth.
-
This application accepts several command-line arguments to control its web crawling and processing behavior:
- URL:
- Description: The starting URLs from which the crawler will begin its operation. Up to 10 URLs can be crawled at the same time.
- Command: Enter the URLs when prompted: Please enter the URLs (comma-separated, no spaces)
- Example: https://example.com,https://www.example2.at
- Crawl Depth:
- Description: The maximum depth the crawler will traverse from the starting URL.
- Command: Specify the depth when prompted: Please enter the Crawl depth
- Example: 3
- Domains to be Crawled:
- Description: Limits the crawler to specific domains to avoid wandering off to unwanted areas of the internet. Leaving it blank will let the crawler accept any domain.
- Command: Enter the domains when prompted, separated by commas: Please enter the domains to be crawled (comma-separated, no spaces):
- Example: example.com,sub.example.com
- Additional links depth (Optional):
- Description: The maximum depth the crawler will traverse for the additionally found URL.
- Command: Specify the depth when prompted: Define the depth for additional links
- Example: 2 | Default: 2
- Path for the MD file (Optional):
- Description: Allows the crawler to store the summary file in a specific path. If no path is defined the results are being stored in a temp directory.
- Command: Enter the path when prompted, Enter the path where the .md File should be stored. Will be stored under temp as per default:
- Example: C:\Users\User\AppData\Local\Temp\tempFolder3885246162413379470\
- URL:
Which website can You use?
Test Website
: https://webscraper.io/test-sites/e-commerce/allinoneArguments
:arg1=URL; arg2:Depth; arg3:Domains; arg4:Additional Links Depth; arg5:Path;
Through IntelliJ
-
IntelliJ IDEA is an integrated development environment written in Java for developing computer software written in Java, Kotlin, Groovy, and other JVM-based languages. Key Features:
-
When the project is created, in the Project tool window (Alt 01), locate the src | main | java | Main.java file and open it in the editor. In the editor, click the gutter icon to run the application and select Run 'Main.main()'. IntelliJ IDEA runs your code. After that, the Run tool window opens at the bottom of the screen.
Read more: IntelliJ Docs
Command Prompt (Windows)
- Compile the Java Code: First, you need to compile your Java code to create .class files. Assuming your current directory is the project root, and your source files are in src/main/java, you would compile your Main.java like this:
javac src/main/java/crawler/Main.java
- Run the Compiled Java Program: After compiling the code, you need to run the .class file. You need to set the classpath to the root directory of your class files and specify the fully qualified name of the main class. Assuming you're still at the project root and your class files are inside src/main/java, you can run your program with:
java -cp src/main/java crawler.Main
- Running with Arguments:
java -cp src/main/java crawler.Main arg1 arg2 arg3 arg4
Read more: WikiHow
Terminal (macOS)
- Open the Terminal
- Navigate to the Project Directory
- Once the Terminal is open, navigate to the directory where your Java files are located using the cd (change directory) command. For example, if your project is located in your Documents folder, you might use a command like:
cd ~/Documents/WebCrawler
- Compile the Java Code
javac src/main/java/crawler/Main.java
- Run the Compiled Java Program
java -cp src/main/java crawler.Main
- Passing Arguments
java -cp src/main/java crawler.Main arg1 arg2 arg3 arg4
Read more : StackOverflow Answer
The crawler implements the following features:
- input the URL, the depth of websites to crawl, and the domain(s) of websites to be crawled
- create a compact overview of the crawled websites
- record only the headings
- represent the depth of the crawled websites with proper indentation (see example)
- record the URLs of the crawled sites
- highlight broken links
- find the links to other websites and recursively do the analysis for those websites (it is enough if you analyze the pages at a depth of 2 without visiting further links, you might also allow the user to configure this depth via the command line)
- store the results in a single markdown file (.md extension)