Folder structure: The src folder has the solr directory. In the solr directory,
- indexer folder contains the script to index the dataset to the solr server.
- webcrawler folder contains the script to run the web crawler to create the dataset.
For the UI,
- user-interface folder contains the UI that connects the solr server to the user-interface.
- Download the solr folder, the user-interface folder & the resource folder. Create the travelData folder in the resource folder.
- The system should have apache solr installed as prerequisite.
- Run the solr server in "http://localhost:8983/" and create the core "travelandeat"
- Give the location of the solr folder path in the instanceDir option in solr server to create the data folder in the local solr folder.
- Run the WebSpider script to crawl and create the dataset, which is in XML format.
- Run the CollectionIndexer script to index the dataset.
- Go to query option in solr server and run execute to check if the dataset has been indexed.
- In order to run the user-interface in "http://localhost:3000/", first run yarn to install all the dependencies and then run yarn start to run the UI.
Please note in some results the title can have similar names but they are different urls taking us to a different webpage
- Indexing
- Inverted Index
- Ranking documents ( using cosine similarity )
- Download the zip file or take a clone
- Run the Main.java class.
To be noted: Please remove the entries from the index.properties file to test the functionality afresh as it will append to the existing entries.
The data structure that represents each entry for the inverted index is a tuple.
The tuple has term ( which is treated as the key ), frequency of the term in all the documents and list of postings.
posting is another data structure which holds the documentId of the document where the term has occurred.
First, all the contents of each document is traversed and converted into tokens.
For tokenizing, all the special characters are removed, numbers are removed, stop words are removed from the contents.
After tokening is done, all the tokens are indexed into a hashmap.
In case, the hashmap already contains the ** tuple ** then the new posting is added to the existing posting list of that term and the frequency is incremented by 1.
After the hashmap is created, it is sorted using LinkedHashMap before it is written into the disk.
Finally, the sorted hashmap is written into the "index.properties" file.
Please note, there is no condition to check if memory becomes full on adding the tuples since, in java, Hashmap internally takes care of creating extra memory if the memory becomes full.