This repository contains the R scripts to scrape The Simpsons Transcripts, as well as the scraped data files, and the assemled output data file.
The source of the data is the Forever Dreaming Transcripts website.
The source transcript (HTML) files for The Simpsons are available at:
https://transcripts.foreverdreaming.org/viewforum.php?f=431&start=725
Scraping-The-Simpsons-Transcripts-with-R.pdf
The ultimate output of the R scripts is the file simpsons-transcripts.txt
.
This is a field-separated file in which the field-separator is the caret
"^"
symbol. The content of simpsons-transcripts.txt
is basically a data
table with five columns:
-
year
: number of year (in which first episode of that season's was aired) -
season
: number of season -
episode
: number of episode -
title
: title of episode -
text
: text of transcript
This data set can be used for text mining purposes.
README.md
Scraping-The-Simpsons-Transcripts-with-R.pdf
code/
script1-scrape-episode-ids.R
script2-download-episode-html-files.R
script3-extract-transcript-lines.R
script4-assemble-output-table.R
data/
episode-ids.txt
simpsons-transcripts.txt
html_files/
episode-21861.html
episode-21862.html
...
episode-73358.html
transcript_files/
season-01-episode-01.txt
season-01-episode-02.txt
...
season-33-episode-22.txt
As a Data Science and Statistics educator, I love to share the work I do. Each month I spend dozens of hours curating learning materials like this resource. If you find any value and usefulness in it, please consider making a one-time donation---via paypal---in any amount (e.g. the amount you would spend inviting me a cup of coffee or any other drink). Your support really matters.