Web Search Engine developed in Java, while web crawler is developed in Python 3.
A simple search engine which is based on the frequency of the key words in the text files.
Project Components:
--> Imported Packages : Text Processing, Sorting
--> Python Web crawler: web-crawler.py
--> Text Files: websites.txt, stop-words.txt
--> Folders: hashmap_data, urls
--> Java File: URLtoText.java - Code to parse URLs to text files.
--> Jave File: SearchEngine.java - Driver Code along with functions.
Concepts Used:
- Sorting (Merge Sort)
- Ternary Search Trie
- Hash Maps
- Text Processing (JSoup, String Functions)
- Memory Management (Caching)
Flow of Execution of the Search Engine:
- Use of Python web crawler to crawl the web and recursively retreive around 1500 URLs.
- Each URL is parsed to a text file using JSoup.
- Stop words are removed from the Search String given by the user.
- String is converted to token using Java String Tokenizer.
- All URLs are indexed into a Hash Map.
- TST is generated for each text file and frequency of keywords are extracted.
- To implement page ranking, frequency of these words along with the URL index are stored in the Hash Map.
- The page ranking Hash Map is sorted in decreasing order of frequency words.
- Page ranking Hash Map is stored in memory to implement cache and drastically improve search time.