/tfidf_text_analysis

Using tf_idf statistics to determine how important a word is to a document in a collection of documents

Primary LanguageR

tfidf_text_analysis

Analyzing text with tf-idf statistics to determine weights of words in document in a collection of documents.

Programming language: R

This excercise will download 5 popular children books, written by English authors in the 19th century, from the Guthenberg Project and analyze the texts of these books using the term frequency-inverse document frequency statistic, tf-idf, to find the most significant words in regards to the book's contents by weighting them more heavily than less significant words in those text documents. This technique has been used in NLP. All the work for the excercise is located in the R Notebook called tfidf_notebook.Rmd. The R script tfidf_text_anlysis.R only contains the script without any explanations. I have also included a pdf file with the visuals that were created.