Slidedeck

title: "Next Word Prediction App" author: "Amber Wang" date: "3/31/2018" output: ioslides_presentation

knitr::opts_chunk$set(echo = FALSE)

This is the tenth course of the Coursera Data Science Specialization, Data Science Capstone. This course focuses on analyzing a large corpus of text documents to discover the structure in the data and how words are put together to build a predictive text model.
Contents
- Text data analysis: analysis of the corpus to understand the relationship of words and word pairs
- Predictive modeling: build basice n-gram models and develop algorithms to facilitate text prediction
- Shiny app development: produce a web-based Shiny app interphase to predict next words

Getting and cleaning the data: profanity was first removed and words tokenized
Exploratory data analysis: the frequencies of words and word paris were calculated
Modeling: 2-7 gram models were built to facilitate word prediction
Prediciton model: - Katz's back-off model was used to predict the next word - The model iterates from 7-gram to 2-gram to find matches in the last n-1 words - In the case of unseen n-gram, the most frequent word, 'the', is returned - To improve efficiency, word pairs that appear less than 5 times in the corpus were removed

The data analysis and model building writeups can be found on GitHub
The Shiny app for prediction can be found here
The app takes in the following inputs:
1. query word/phrase that the user inputs
2. number of predicted next word
The predicted next word(s) will show up in the order of most frequently used to less frequently used