R-text-data

The goal of this repository is to act as a collection of textual data set to be used for training and practice in text mining/NLP in R. This repository will not be a guide on how to do text analysis/mining but rather how to get a data set to get started with minimal hassle.

Main page
CRAN packages
Github packages
- sacred
- hcandersenr
- harrypotter
- subtools
- rperseus
Wild data
- Cornell data
- SouthParkData

CRAN packages

janeaustenr

First we have the janeaustenr package popularized by Julia Silge in tidytextmining.

#install.packages("janeaustenr")
library(janeaustenr)

janeaustenr includes 6 books; emma, mansfieldpark, northangerabbey, persuasion, prideprejudice and sensesensibility all formatted as a character vector with elements of about 70 characters.

head(emma, n = 15)
#>  [1] "EMMA"                                                               
#>  [2] ""                                                                   
#>  [3] "By Jane Austen"                                                     
#>  [4] ""                                                                   
#>  [5] ""                                                                   
#>  [6] ""                                                                   
#>  [7] ""                                                                   
#>  [8] "VOLUME I"                                                           
#>  [9] ""                                                                   
#> [10] ""                                                                   
#> [11] ""                                                                   
#> [12] "CHAPTER I"                                                          
#> [13] ""                                                                   
#> [14] ""                                                                   
#> [15] "Emma Woodhouse, handsome, clever, and rich, with a comfortable home"

All the books can also be found combined into one data.frame in the function austen_books()

dplyr::glimpse(austen_books())
#> Observations: 73,422
#> Variables: 2
#> $ text <chr> "SENSE AND SENSIBILITY", "", "by Jane Austen", "", "(1811...
#> $ book <fct> Sense & Sensibility, Sense & Sensibility, Sense & Sensibi...

Examples:

https://juliasilge.com/blog/if-i-loved-nlp-less/

gutenbergr

The gutenbergr package allows for search and download of public domain texts from Project Gutenberg. Currently includes more then 57,000 free eBooks.

#install.packages("gutenbergr")
library(gutenbergr)

To use gutenbergr you must know the Gutenberg id of the work you wish to analyze. A text search of the works can be done using the gutenberg_works function.

gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 x 8
#>   gutenberg_id title author gutenberg_autho… language gutenberg_books…
#>          <int> <chr> <chr>             <int> <chr>    <chr>           
#> 1          768 Wuth… Bront…              405 en       Gothic Fiction/…
#> # ... with 2 more variables: rights <chr>, has_text <lgl>

With that id you can use the gutenberg_download() function to

gutenberg_download(768)
#> Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
#> Using mirror http://aleph.gutenberg.org
#> # A tibble: 12,085 x 2
#>    gutenberg_id text                                                      
#>           <int> <chr>                                                     
#>  1          768 WUTHERING HEIGHTS                                         
#>  2          768 ""                                                        
#>  3          768 ""                                                        
#>  4          768 CHAPTER I                                                 
#>  5          768 ""                                                        
#>  6          768 ""                                                        
#>  7          768 1801.--I have just returned from a visit to my landlord--…
#>  8          768 neighbour that I shall be troubled with.  This is certain…
#>  9          768 country!  In all England, I do not believe that I could h…
#> 10          768 situation so completely removed from the stir of society.…
#> # ... with 12,075 more rows

Examples:

Still pending.

text2vec

While the text2vec package is data package by itself, it does include a textual data set inside.

#install.packages("text2vec")
library(text2vec)

The data frame movie_review contains 5000 IMDB movie reviews selected for sentiment analysis. It has been preprocessed to include sentiment that means that an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1.

dplyr::glimpse(movie_review)
#> Observations: 5,000
#> Variables: 3
#> $ id        <chr> "5814_8", "2381_9", "7759_3", "3630_4", "9495_8", "8...
#> $ sentiment <int> 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0...
#> $ review    <chr> "With all this stuff going down at the moment with M...

Github packages

sacred

The sacred package includes 9 tidy data sets: apocrypha, book_of_mormon, doctrine_and_covenants, greek_new_testament, king_james_version, pearl_of_great_price, tanach, vulgate and septuagint with column describing the position within each work.

#devtools::install_github("JohnCoene/sacred")
library(sacred)

dplyr::glimpse(apocrypha)
#> Observations: 5,725
#> Variables: 5
#> $ book.num <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ book     <chr> "es1", "es1", "es1", "es1", "es1", "es1", "es1", "es1...
#> $ psalm    <chr> "11", "11", "11", "11", "11", "11", "11", "11", "11",...
#> $ verse    <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "1...
#> $ text     <chr> "And Josias held the feast of the passover in Jerusal...

Examples:

Still pending.

hcandersenr

The hcandersenr package includes many of H.C. Andersen’s fairy tales in 5 difference languages.

#devtools::install_github("EmilHvitfeldt/hcandersenr")
library(hcandersenr)

The fairy tales are found in the following data frames hcandersen_en, hcandersen_da, hcandersen_de, hcandersen_es and hcandersen_fr for the English, Danish, German, Spanish and French versions respectively. Please be advised that all fairy tales aren’t available in all languages in this package.

dplyr::glimpse(hcandersen_en)
#> Observations: 27,859
#> Variables: 2
#> $ text <chr> "A soldier came marching along the high road: \"Left, rig...
#> $ book <chr> "The tinder-box", "The tinder-box", "The tinder-box", "Th...

All the fairy tales are collected in the following data.frame:

dplyr::glimpse(hca_fairytales)
#> Observations: 115,247
#> Variables: 3
#> $ text     <chr> "Der kom en soldat marcherende hen ad landevejen: én,...
#> $ book     <chr> "The tinder-box", "The tinder-box", "The tinder-box",...
#> $ language <chr> "Danish", "Danish", "Danish", "Danish", "Danish", "Da...

Examples:

Still pending.

harrypotter

The harrypotter package includes the text from all 7 main series books.

#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)

the 7 books; philosophers_stone, chamber_of_secrets, prisoner_of_azkaban, goblet_of_fire, order_of_the_phoenix, half_blood_prince and deathly_hallows are formatted as character vectors with a chapter for each string.

dplyr::glimpse(harrypotter::chamber_of_secrets)
#>  chr [1:19] "THE WORST BIRTHDAY  Not for the first time, an argument had broken out over breakfast at number four, Privet "| __truncated__ ...

Examples:

subtools

The subtools package doesn’t include any textual data, but allows you to read subtitle files.

#devtools::install_github("fkeck/subtools")
library(subtools)

the use of this function can be found in the examples.

Examples:

rperseus

The goal of rperseus is to furnish classicists, textual critics, and R enthusiasts with texts from the Classical World. While the English translations of most texts are available through gutenbergr, rperseus returns these works in their original language–Greek, Latin, and Hebrew.

#devtools::install_github("ropensci/rperseus")
library(rperseus)
aeneid_latin <- perseus_catalog %>% 
  filter(group_name == "Virgil",
         label == "Aeneid",
         language == "lat") %>% 
  pull(urn) %>% 
  get_perseus_text()
head(aeneid_latin)
#> # A tibble: 6 x 7
#>   text                 urn   group_name label description language section
#>   <chr>                <chr> <chr>      <chr> <chr>       <chr>      <int>
#> 1 Arma virumque cano,… urn:… Virgil     Aene… "Perseus:b… lat            1
#> 2 Conticuere omnes, i… urn:… Virgil     Aene… "Perseus:b… lat            2
#> 3 Postquam res Asiae … urn:… Virgil     Aene… "Perseus:b… lat            3
#> 4 At regina gravi iam… urn:… Virgil     Aene… "Perseus:b… lat            4
#> 5 Interea medium Aene… urn:… Virgil     Aene… "Perseus:b… lat            5
#> 6 Sic fatur lacrimans… urn:… Virgil     Aene… "Perseus:b… lat            6

See the vignette for more examples.

Wild data

This sections includes public data sets and how to import them into R ready for analysis. It is generally advised to save the resulting data such that you don’t re-download the data excessively.

Movie Review Data

This website include a handful of different movie review data sets. Below is the code chuck necessary to load in the data sets.

polarity dataset v2.0

library(tidyverse)
library(fs)

filepath <- file_temp() %>%
  path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

data <- map_df(file_names, 
               ~ tibble(text = read_lines(.x),
                        polarity = str_detect(.x, "pos"),
                        cv_tag = str_extract(.x, "(?<=cv)\\d{3}"),
                        html_tag = str_extract(.x, "(?<=cv\\d{3}_)\\d*")))

glimpse(data)
#> Observations: 64,720
#> Variables: 4
#> $ text     <chr> "plot : two teen couples go to a church party , drink...
#> $ polarity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
#> $ cv_tag   <chr> "000", "000", "000", "000", "000", "000", "000", "000...
#> $ html_tag <chr> "29416", "29416", "29416", "29416", "29416", "29416",...

sentence polarity dataset v1.0

library(tidyverse)
library(fs)

filepath <- file_temp() %>%
  path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

data <- map_df(file_names, 
               ~ tibble(text = read_lines(.x),
                        polarity = str_detect(.x, "pos")))

glimpse(data)
#> Observations: 10,662
#> Variables: 2
#> $ text     <chr> "simplistic , silly and tedious . ", "it's so laddish...
#> $ polarity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...

scale dataset v1.0

library(tidyverse)
library(fs)

filepath <- file_temp() %>%
  path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

subjs <- str_subset(file_names, "subj")
ids <- str_subset(file_names, "id")
ratings <- str_subset(file_names, "rating")
names <- str_extract(ratings, "(?<=rating.).*") %>%
  str_replace("\\+", " ")

data <- map_df(seq_len(length(names)), 
               ~ tibble(text = read_lines(subjs[.x]),
                        id = read_lines(ids[.x]),
                        rating = read_lines(ratings[.x]),
                        name = names[.x]))

glimpse(data)
#> Observations: 5,006
#> Variables: 4
#> $ text   <chr> "in my opinion , a movie reviewer's most important task...
#> $ id     <chr> "29420", "17219", "18406", "18648", "20021", "20454", "...
#> $ rating <chr> "0.1", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2",...
#> $ name   <chr> "Dennis Schwartz", "Dennis Schwartz", "Dennis Schwartz"...

subjectivity dataset v1.0

library(tidyverse)
library(fs)

filepath <- file_temp() %>%
  path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

data <- map_df(file_names, 
               ~ tibble(text = read_lines(.x),
                        label = if_else(str_detect(.x, "quote"), 
                                        "subjective", 
                                        "objective")))

glimpse(data)
#> Observations: 10,000
#> Variables: 2
#> $ text  <chr> "smart and alert , thirteen conversations about one thin...
#> $ label <chr> "subjective", "subjective", "subjective", "subjective", ...

SouthParkData

the following github repository BobAdamsEE/SouthParkData includes the script of the first 19 seasons of South Park. The following code snippet lets you download them all at once.

url_base <- "https://raw.githubusercontent.com/BobAdamsEE/SouthParkData/master/by-season"
urls <- paste0(url_base, "/Season-", 1:19, ".csv")

data <- map_df(urls, ~ read_csv(.x))

Examples:

https://www.kaylinpavlik.com/text-mining-south-park/

sidmehta1975/R-text-data

R-text-data

Table of Contents

CRAN packages

janeaustenr

gutenbergr

text2vec

Github packages

sacred

hcandersenr

harrypotter

subtools

rperseus

Wild data

polarity dataset v2.0

sentence polarity dataset v1.0

scale dataset v1.0

subjectivity dataset v1.0

SouthParkData