This is the repository for R scripts and data that will be used in a Special Topics course taught in the Department of Statistics at Carnegie Mellon University (Fall 2019).
The code that is posted here is meant to be a kind of starter kit. They should help catalyze ideas for your own scipts -- scripts appropriate to the specific projects you choose to carry out over the course of the term.
We will walk though some of the scripts in class, considering the challeges of NLP generally and NLP in R specifically.
The repository will be added to over the course of the term. So be mindful of the updates.
For the purposes of this class, we will largely be using an R package called quanteda to do most of our text processing. Other popular packages include korpus and tm.
We are using quanteda because it handles metadata relatively well and doesn't rely on Java to do its tokenizing. It is also well documented:
However, it is a bit flabby. This shouldn't present much of a problem. However, you might want to mindful about installing and updating -- in the middle of a class activity probably not the best time.
Also, you have probable already installed the tidyverse package. If not, do so.