CS6307Project2

Repo for CS6307 Project 2 - Big Data Management and Analytics

Part 1

Extract Named Entities from a book using Natural Language Processing (NLP).

Part 2

Design a search engine for movie plots using a tf-idf and cosine similarity.

Scala code

Instructions

-Create a cluster with Scala Version 2.11 - Spark Version 2.4.5, Databricks Runtime 6.4 Extended Support is preferred.

-Install the spark-nlp package from JohnSnowLabs:spark-nlp version 2.7.5 using Maven source (you can use the coordinates: JohnSnowLabs:spark-nlp:2.7.5 directly).

-Part 1 and Part 2 of the project are included in the same notebook. Some dependencies will be downloaded and install on the first run. The running time of both parts together can take arround 15 minutes.

-For part 1 use sherlock_holmes.txt as the book, for part 2 use user_terms.txt as the user queries.

rsrjohnson/CS6307Project2

CS6307Project2

Part 1

Part 2

Scala code

Instructions