The goal of this repository is to act as a collection of textual data set to be used for training and practice in text mining/NLP in R. This repository will not be a guide on how to do text analysis/mining but rather how to get a data set to get started with minimal hassle.
First we have the janeaustenr package popularized by Julia Silge in tidytextmining.
#install.packages("janeaustenr")
library(janeaustenr)
janeaustenr
includes 6 books; emma
, mansfieldpark
,
northangerabbey
, persuasion
, prideprejudice
and sensesensibility
all formatted as a character vector with elements of about 70
characters.
head(emma, n = 15)
#> [1] "EMMA"
#> [2] ""
#> [3] "By Jane Austen"
#> [4] ""
#> [5] ""
#> [6] ""
#> [7] ""
#> [8] "VOLUME I"
#> [9] ""
#> [10] ""
#> [11] ""
#> [12] "CHAPTER I"
#> [13] ""
#> [14] ""
#> [15] "Emma Woodhouse, handsome, clever, and rich, with a comfortable home"
All the books can also be found combined into one data.frame in the
function austen_books()
dplyr::glimpse(austen_books())
#> Observations: 73,422
#> Variables: 2
#> $ text <chr> "SENSE AND SENSIBILITY", "", "by Jane Austen", "", "(1811...
#> $ book <fct> Sense & Sensibility, Sense & Sensibility, Sense & Sensibi...
Examples:
The gutenbergr package allows for search and download of public domain texts from Project Gutenberg. Currently includes more then 57,000 free eBooks.
#install.packages("gutenbergr")
library(gutenbergr)
To use gutenbergr you must know the Gutenberg id of the work you
wish to analyze. A text search of the works can be done using the
gutenberg_works
function.
gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 x 8
#> gutenberg_id title author gutenberg_autho… language gutenberg_books…
#> <int> <chr> <chr> <int> <chr> <chr>
#> 1 768 Wuth… Bront… 405 en Gothic Fiction/…
#> # ... with 2 more variables: rights <chr>, has_text <lgl>
With that id you can use the gutenberg_download()
function to
gutenberg_download(768)
#> Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
#> Using mirror http://aleph.gutenberg.org
#> # A tibble: 12,085 x 2
#> gutenberg_id text
#> <int> <chr>
#> 1 768 WUTHERING HEIGHTS
#> 2 768 ""
#> 3 768 ""
#> 4 768 CHAPTER I
#> 5 768 ""
#> 6 768 ""
#> 7 768 1801.--I have just returned from a visit to my landlord--…
#> 8 768 neighbour that I shall be troubled with. This is certain…
#> 9 768 country! In all England, I do not believe that I could h…
#> 10 768 situation so completely removed from the stir of society.…
#> # ... with 12,075 more rows
Examples:
Still pending.
While the text2vec package is data package by itself, it does include a textual data set inside.
#install.packages("text2vec")
library(text2vec)
The data frame movie_review
contains 5000 IMDB movie reviews selected
for sentiment analysis. It has been preprocessed to include sentiment
that means that an IMDB rating < 5 results in a sentiment score of 0,
and a rating >=7 has a sentiment score of 1.
dplyr::glimpse(movie_review)
#> Observations: 5,000
#> Variables: 3
#> $ id <chr> "5814_8", "2381_9", "7759_3", "3630_4", "9495_8", "8...
#> $ sentiment <int> 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0...
#> $ review <chr> "With all this stuff going down at the moment with M...
The sacred package includes 9 tidy data sets: apocrypha
,
book_of_mormon
, doctrine_and_covenants
, greek_new_testament
,
king_james_version
, pearl_of_great_price
, tanach
, vulgate
and
septuagint
with column describing the position within each work.
#devtools::install_github("JohnCoene/sacred")
library(sacred)
dplyr::glimpse(apocrypha)
#> Observations: 5,725
#> Variables: 5
#> $ book.num <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ book <chr> "es1", "es1", "es1", "es1", "es1", "es1", "es1", "es1...
#> $ psalm <chr> "11", "11", "11", "11", "11", "11", "11", "11", "11",...
#> $ verse <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "1...
#> $ text <chr> "And Josias held the feast of the passover in Jerusal...
Examples:
Still pending.
The hcandersenr package includes many of H.C. Andersen’s fairy tales in 5 difference languages.
#devtools::install_github("EmilHvitfeldt/hcandersenr")
library(hcandersenr)
The fairy tales are found in the following data frames hcandersen_en
,
hcandersen_da
, hcandersen_de
, hcandersen_es
and hcandersen_fr
for the English, Danish, German, Spanish and French versions
respectively. Please be advised that all fairy tales aren’t available in
all languages in this package.
dplyr::glimpse(hcandersen_en)
#> Observations: 27,859
#> Variables: 2
#> $ text <chr> "A soldier came marching along the high road: \"Left, rig...
#> $ book <chr> "The tinder-box", "The tinder-box", "The tinder-box", "Th...
All the fairy tales are collected in the following data.frame:
dplyr::glimpse(hca_fairytales)
#> Observations: 115,247
#> Variables: 3
#> $ text <chr> "Der kom en soldat marcherende hen ad landevejen: én,...
#> $ book <chr> "The tinder-box", "The tinder-box", "The tinder-box",...
#> $ language <chr> "Danish", "Danish", "Danish", "Danish", "Danish", "Da...
Examples:
Still pending.
The harrypotter package includes the text from all 7 main series books.
#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
the 7 books; philosophers_stone
, chamber_of_secrets
,
prisoner_of_azkaban
, goblet_of_fire
, order_of_the_phoenix
,
half_blood_prince
and deathly_hallows
are formatted as character
vectors with a chapter for each string.
dplyr::glimpse(harrypotter::chamber_of_secrets)
#> chr [1:19] "THE WORST BIRTHDAY Not for the first time, an argument had broken out over breakfast at number four, Privet "| __truncated__ ...
Examples:
- Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R
- Harry Plotter: Part 2 – Hogwarts Houses and their Stereotypes
The subtools package doesn’t include any textual data, but allows you to read subtitle files.
#devtools::install_github("fkeck/subtools")
library(subtools)
the use of this function can be found in the examples.
Examples:
- Movies and series subtitles in R with subtools
- A tidy text analysis of Rick and Morty
- You beautiful, naïve, sophisticated newborn series
The goal of rperseus is to furnish classicists, textual critics, and R
enthusiasts with texts from the Classical World. While the English
translations of most texts are available through gutenbergr
, rperseus
returns these works in their original language–Greek, Latin, and Hebrew.
#devtools::install_github("ropensci/rperseus")
library(rperseus)
aeneid_latin <- perseus_catalog %>%
filter(group_name == "Virgil",
label == "Aeneid",
language == "lat") %>%
pull(urn) %>%
get_perseus_text()
head(aeneid_latin)
#> # A tibble: 6 x 7
#> text urn group_name label description language section
#> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 Arma virumque cano,… urn:… Virgil Aene… "Perseus:b… lat 1
#> 2 Conticuere omnes, i… urn:… Virgil Aene… "Perseus:b… lat 2
#> 3 Postquam res Asiae … urn:… Virgil Aene… "Perseus:b… lat 3
#> 4 At regina gravi iam… urn:… Virgil Aene… "Perseus:b… lat 4
#> 5 Interea medium Aene… urn:… Virgil Aene… "Perseus:b… lat 5
#> 6 Sic fatur lacrimans… urn:… Virgil Aene… "Perseus:b… lat 6
See the vignette for more examples.
This sections includes public data sets and how to import them into R ready for analysis. It is generally advised to save the resulting data such that you don’t re-download the data excessively.
This website include a handful of different movie review data sets. Below is the code chuck necessary to load in the data sets.
library(tidyverse)
library(fs)
filepath <- file_temp() %>%
path_ext_set("tar.gz")
download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz", filepath)
file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]
untar(filepath, files = file_names)
data <- map_df(file_names,
~ tibble(text = read_lines(.x),
polarity = str_detect(.x, "pos"),
cv_tag = str_extract(.x, "(?<=cv)\\d{3}"),
html_tag = str_extract(.x, "(?<=cv\\d{3}_)\\d*")))
glimpse(data)
#> Observations: 64,720
#> Variables: 4
#> $ text <chr> "plot : two teen couples go to a church party , drink...
#> $ polarity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
#> $ cv_tag <chr> "000", "000", "000", "000", "000", "000", "000", "000...
#> $ html_tag <chr> "29416", "29416", "29416", "29416", "29416", "29416",...
library(tidyverse)
library(fs)
filepath <- file_temp() %>%
path_ext_set("tar.gz")
download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz", filepath)
file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]
untar(filepath, files = file_names)
data <- map_df(file_names,
~ tibble(text = read_lines(.x),
polarity = str_detect(.x, "pos")))
glimpse(data)
#> Observations: 10,662
#> Variables: 2
#> $ text <chr> "simplistic , silly and tedious . ", "it's so laddish...
#> $ polarity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
library(tidyverse)
library(fs)
filepath <- file_temp() %>%
path_ext_set("tar.gz")
download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz", filepath)
file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]
untar(filepath, files = file_names)
subjs <- str_subset(file_names, "subj")
ids <- str_subset(file_names, "id")
ratings <- str_subset(file_names, "rating")
names <- str_extract(ratings, "(?<=rating.).*") %>%
str_replace("\\+", " ")
data <- map_df(seq_len(length(names)),
~ tibble(text = read_lines(subjs[.x]),
id = read_lines(ids[.x]),
rating = read_lines(ratings[.x]),
name = names[.x]))
glimpse(data)
#> Observations: 5,006
#> Variables: 4
#> $ text <chr> "in my opinion , a movie reviewer's most important task...
#> $ id <chr> "29420", "17219", "18406", "18648", "20021", "20454", "...
#> $ rating <chr> "0.1", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2",...
#> $ name <chr> "Dennis Schwartz", "Dennis Schwartz", "Dennis Schwartz"...
library(tidyverse)
library(fs)
filepath <- file_temp() %>%
path_ext_set("tar.gz")
download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz", filepath)
file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]
untar(filepath, files = file_names)
data <- map_df(file_names,
~ tibble(text = read_lines(.x),
label = if_else(str_detect(.x, "quote"),
"subjective",
"objective")))
glimpse(data)
#> Observations: 10,000
#> Variables: 2
#> $ text <chr> "smart and alert , thirteen conversations about one thin...
#> $ label <chr> "subjective", "subjective", "subjective", "subjective", ...
the following github repository BobAdamsEE/SouthParkData includes the script of the first 19 seasons of South Park. The following code snippet lets you download them all at once.
url_base <- "https://raw.githubusercontent.com/BobAdamsEE/SouthParkData/master/by-season"
urls <- paste0(url_base, "/Season-", 1:19, ".csv")
data <- map_df(urls, ~ read_csv(.x))
Examples: