title: "Next Word Prediction App" author: "Amber Wang" date: "3/31/2018" output: ioslides_presentation
knitr::opts_chunk$set(echo = FALSE)
- This is the tenth course of the Coursera Data Science Specialization, Data Science Capstone. This course focuses on analyzing a large corpus of text documents to discover the structure in the data and how words are put together to build a predictive text model.
- Contents
- Text data analysis: analysis of the corpus to understand the relationship of words and word pairs
- Predictive modeling: build basice n-gram models and develop algorithms to facilitate text prediction
- Shiny app development: produce a web-based Shiny app interphase to predict next words
- Getting and cleaning the data: profanity was first removed and words tokenized
- Exploratory data analysis: the frequencies of words and word paris were calculated
- Modeling: 2-7 gram models were built to facilitate word prediction
- Prediciton model: - Katz's back-off model was used to predict the next word - The model iterates from 7-gram to 2-gram to find matches in the last n-1 words - In the case of unseen n-gram, the most frequent word, 'the', is returned - To improve efficiency, word pairs that appear less than 5 times in the corpus were removed
- The data analysis and model building writeups can be found on GitHub
- The Shiny app for prediction can be found here
- The app takes in the following inputs:
- query word/phrase that the user inputs
- number of predicted next word
- The predicted next word(s) will show up in the order of most frequently used to less frequently used
- This course is part of the Coursera Data Science Specialization
- The Quanteda package was used for data analysis and n-gram generation
- Read more about Katz's back-off model