Amazon Appstore for Android opened on 3/22/2011 and was made available in nearly 200 countries. Developers are paid 70% of the list price of the app or in-app purchase.

The goal of this project

To help developers find the needs of the customers and better adjust their direction of quality assurance, add/remove the functionalities, and debug promptly to maintain/increase customers.

What is done in this project?

The dataset is preprocessed into lemmatized corpus and the topics are modelled using LDA or other methods where the models are tuned based on coherence scores. Each topic will be assigned a human-interpretable label that can deliver information to developers. The topics can be interpreted using the keywords, their prevalence or weights, the reviews that are most representative within each topic, and the wordclouds.
BERT is trained with these topic labels and it is compared with LDA topic models to see how their classifications are different. The comparison is based on the 2d-plots where representative embeddings are projected and the topic distributions of reviews that have different classification results.


The dataset is from "Amazon Customer Reviews Dataset" that are publicly available in S3 bucket in AWS US East Region. The dataset used for this project is the subset of "amazon_reviews_us_Mobile_Apps_v1_00.tsv" file which contains information of each review on different apps. Only the subset of the data is used for this project and the app used for this project is "Netflix" which has one of the most reviews between 2010-11-04 and 2015-08-31. A (shuffled) half of the reviews are used for the project and the other half is retained as hold-out set for future use. There are 12,566 reviews used for topic modeling in this project.


gensim, spacy, NLTK for preprocessing
gensim, pyLDAvis for LDA, NMF, LSA
ktrain, transformers, bert_embedding for BERT
PIL, wordcloud, sklearn, umap, gensim for visualization
Jupyter lab for modeling
PyCharm for Flask


  1. grid_search.ipynb
    Contains the coherence scores in each combination of 'Number of Topics' and $\alpha$ in heatmaps to help select the best model.
  2. LDA_netflix.ipynb
    Interprets each LDA model with the tuned hyperparameters.
  3. NMF_LSA_topic_modeling.ipynb
    Interprets NMF and LSA models with the tuned hyperparameters.
  4. LDA_topic_labelling.ipynb
    Labels each topic in an interpretable way.
  5. LDA_classification.ipynb
    Predicts topics with unseen data(hold-out data)
  6. BERT.ipynb
    Predicts topics with unseen data using BERT


  1. Datasets
    raw_data, preprocessed_data
  2. Predictions
  3. pyLDAvis Visualization
    mallet_lda_vis, std_lda_vis
  4. Models
    bert_model, lda_mallet_model
  5. Coherence scores from grid search
  6. Images

Python scripts

  1. utils.py
    Inlcudes NLPpipe class for pipelining and helper functions for interpretation
  2. predictor_api.py, predictor_app.py, templates/predictor.py are for Flask app. Accredited to link



