This is a ngram model to predict the next word in English text based on some history words. To understand fully how it works, read the report in main.html. To see a working POC web app, go to https://dmitrytoda.shinyapps.io/SwiftPredict/
All the code comes as separate helper functions, one per file, that are described below.
These functions are needed to build the model from data set:
-
createSample()
randomly chooses some lines from an input text file to create a smaller size sample for exploratory analysis -
file2sentences()
reads text from file(s) and returns aquanteda::corpus
object split into one-sentece documents -
str2tokens()
splits each (one-sentece) text into tokens (words) with the following preprocessing:- remove puncuation - remove special symbols - remove numbers - remove URLs - remove tokens that contain no letters - remove tokens that contain non-English characters
-
nFreq()
builds a ngram frequency table for a givenquanteda::tokens
object (that is, calculates how many times every given ngram is observed) -
removeOOD()
receives a ngram frequency table, replaces out-of-dictionary words with a special "" token and recalculates frequencies by collapsing equivalent ngrams -
keep3()
keeps only top-3 predictions for each possible history and replaces integer ngram frequencies with factor prediction ranks (1, 2, 3)
Once you have a prepared model, only two functions are needed to use it:
combined_predict()
makes predictions based on some history, a ngram model and a dictonarymy_cond()
is a helper function to createdata.table
-compatible conditions for fast ngram binary search
Shiny_app
folder contains all the code for a POC Shiny web app that allows you to input any text and get a prediction along with ngrams that contributed to it.
server.R
andUI.R
contain server and UI code respectivelymodel20_with_dict
contains the 20,000 word dictionary and 1- to 6-gram model itself- other files are just copied from the root folder for deployment to Shiny servers
_shrink model.R
andcollapse_ngrams.R
were used to collapse the initial big 50k word dictionary model to the current 20k one