WebCrawler: A Java repository from itsttk

This project crawls any website and will get the most releavant topic list


Classes:
The project contains three classes 

1.	TextFormat.java: This contains main methods and other methods to check whether URL is valid or not. If URL is valid it uses other classes to perform scrapping.
2.	TextFormat.java: This contains methods to perform text cleaning like removing stop words and special characters. Once processing is done this uses Format_helper.java class to extract common topics.
3.	Format_helper.java: It uses HashMap data structure to store keywords and their frequency. Then we will use sorting (Descending order) algorithm to sort HashMap based on value in <key, value> pair. Then we will return the top words of map (as these are most common topics in document from URL) and printing to the console.

Stop Word References:
1.https://algs4.cs.princeton.edu/35applications/stopwords.txt
2.https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/nlp/patterns/surface/stopwords.txt


External Libraries:
I have used java HTML parser (jsoup-1.11.3.jar) to perform web scrapping.
itsttk/WebCrawler