
Term frequency analysis of Drug Review Dataset with Apache Spark

Primary LanguageJupyter Notebook


Term frequency analysis of Drug Review Dataset with Apache Spark, as part of the Big Data Paralelle Programming course at Halmstad University.

NOTE There's a wierd bug running the Lemmatizing-cell, you have to execute this cell 3 times.

Read my paper about this project here.


Dataset acquired from UCI Machine Learing Repository


Surya Kallumadi

Kansas State University

Manhattan, Kansas, USA

surya '@' ksu.edu

Felix Gräßer

Institut für Biomedizinische Technik

Technische Universität Dresden

Dresden, Germany

felix.graesser '@' tu-dresden.de

Relevant paper

Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125. DOI:Web Link


These libraries are available via pip:

  • wordcloud
  • nltk
  • numpy
  • pandas
  • matplotlib