aasnani/CS532-SearchEngine

Project 1 - Systems for Data Science

Requirements:

Java 1.8 (IntelliJ)
Maven (Enable Auto-Import of Maven projects)
- org.apache.spark:spark-core_2.10:2.1.0
- org.rocksdb:rocksdbjni:5.15.10
- org.apache.hadoop:hadoop-common:3.1.1
Apache Spark
Apache Hadoop

Included in Project:

All source files
Id_Urls.txt - File containing document ID and associated URL
Web Page Document Files - inside the projectdata folder

To Run:

Open Project in IntelliJ
Enable Auto-Import from Maven, when prompt appears in bottom right corner.
Run class titled InvertedIndex. All code is stored in this class file.
Send any HTTP request method to: localhost:8843/search with space separated search terms (in the body in raw format, if using Postman to send request)

Text processing done for the assignment:

All text from document files is converted to lowercase and separated into individual tokens by whitespace only. Punctuation was not removed or separated. (ex. diatetes and obesity,)
Queries received from client are also converted to lower case before returning search results, this is done to match the lowercase text tokens.
All duplicate URLs are removed from search results.

Testing:

We tested this search engine using Postman.

Example result for query term obesity, diabetes: Query Example

Authors:

Armand Asnani Paige Calisi