Repo for CS6307 Project 2 - Big Data Management and Analytics
Extract Named Entities from a book using Natural Language Processing (NLP).
Design a search engine for movie plots using a tf-idf and cosine similarity.
-Create a cluster with Scala Version 2.11 - Spark Version 2.4.5, Databricks Runtime 6.4 Extended Support is preferred.
-Install the spark-nlp package from JohnSnowLabs:spark-nlp version 2.7.5 using Maven source (you can use the coordinates: JohnSnowLabs:spark-nlp:2.7.5 directly).
-Part 1 and Part 2 of the project are included in the same notebook. Some dependencies will be downloaded and install on the first run. The running time of both parts together can take arround 15 minutes.
-For part 1 use sherlock_holmes.txt as the book, for part 2 use user_terms.txt as the user queries.