Rediscovering Text as Data

Instructors: Christopher Hench & Claudia von Vacano
Course Location: 310 Hearst Mining
Course Time: Monday 4-6pm
Instructor's Office: Barrows 350
Instructor's Office Hours: Thursday 10-11 AM
Instructors' Email: chench@berkeley.edu, cvacano@berkeley.edu

Course Description: Humanists have traditionally emphasized the ‘close reading’ of a text, where value is placed on the nuances of specific passages. The increasing amount of digital text being published and archived affords us an opportunity to read text differently—as data on a scale larger than ever before. This ‘distant reading’ approach (mediated through the computer) complements our ‘close reading’ by providing a broader context for interpretation previously inaccessible. It also allows us to quantify and model language, such as words in novels or syllables in poetry, to uncover hidden patterns in a single text or body of texts. In this course, we will help you find and explore newly available texts of interest to you and guide your understanding of textual phenomena obtained through computational methods, enriching your reading of an individual text.

As a connector course to Data 8 (Foundations of Data Science), this class will give students experience in the Python programming language. Students must be concurrently enrolled in the main course (Data 8) or have already completed it.

Computing:

Course GitHub Repository
Datahub Interact Link
Programming Language: Python 3 (recommend datahub.berkeley.edu; alternatively local Anaconda installation: https://www.continuum.io/downloads). This course assumes you are learning Python fundamentals in Data 8.

Goals & Format: Rediscovering Text as Data is a disciplinary connector to the main Foundations of Data Science (Data 8) course. The main course provides a baseline of programming, statistical concepts, and data visualization. In this connector, we will practice each of these and apply them toward problems in the study of literature and the humanities more broadly. While much of what we discuss will concern literary texts, many of the methods and techniques we employ are applicable to textual data across domains and in industry. Students are encouraged to work with whichever corpus interests them, as long as critical conclusions relevant to the course are drawn.

We will dedicate two days to close reading, one at the beginning and one at the end of the course. This will ground our study in traditional aesthetic concerns and force us to return to tradition with our new perspective. The other weeks will address various computational methods employed in the Digital Humanities. We will work consciously to understand what we read as literature by looking at a wide variety of literary texts and evaluating arguments that critics have made about them.

Schedule: Below is an outline of the course. All readings will be available to download on bCourses or in the GitHub repository. The schedule is subject to change.

Week	Topic	Readings	Assignment
1 - 8/28	Introduction
2 - 9/11	Operationalizing	(1) Sophocles, Antigone; (2) Moretti, "Operationalizing"	Notebook Exercises
3 - 9/18	Close Reading I & Strings	(1) Kafka, "The Judgement"; (2) Berman, "Tradition and Betrayal in 'Das Urteil'"; (3) Kafka's short biography on Wikipedia	Notebook Exercises
4 - 9/25	Stylometry	(1) Caedmon's Hymn; (2) Thornbury, "The Poet Alone"	Close Reading Paper - Assigned; Notebook Exercises
5 - 10/02	Intro to NLTK & SpaCy		Notebook Exercises
6 - 10/09	Entity Extraction & Network Analysis		Close Reading Paper - Due; Notebook Exercises
7 - 10/16	Textual Similarity & Clustering		Notebook Exercises
8 - 10/23	Classification		Final Project: Consultation; Notebook Exercises
9 - 10/30	Topic Modeling		Final Project: Proposal; Notebook Exercises
10 - 11/06	Metadata		Final Project: Push preliminary code; Notebook Exercises
11 - 11/13	Word Embeddings		Final Project: Consultation; Notebook Exercises
12 - 11/20	Close Reading II
13 - 11/27	Project Elevator Pitches

Readings & Presentation: All readings must be completed before class.

Once during the semester, each student will be required to make a brief presentation on the week's critical reading that will initiate our discussion. This presentation should offer a summary of the article, including any context that may help us to understand its concerns, and describe some of the problems it explores. The presentation should begin by raising a few questions that will spur our discussion before presenting what has interested you. Students must submit the summary that will guide the presentation following the template.

Sign-up link here!

Participation: Please prepare to speak at least once during discussion, each class. Your voice is valuable and your perspective unique. We will be completing Google Form surveys throughout the course for purposes of data collection and keeping tabs on participation.

Close Reading Paper: The first paper assigned will be a traditional, literature-class paper. You will make an interpretive argument based on a close reading of a text. This paper will be written on one of the literary texts we have read in the first four weeks of class, and I will offer an optional prompt for each. The paper should be 2 pages (double-spaced) in length.

Final Project: The course is built around the final project (which replaces the final exam). This consists of a 4-5 page (double-spaced) paper in which an argument is made about a text(s) using evidence from both inferential statistics and close reading. This paper must examine an interpretive problem and may be written on any text(s) you choose, literary or other. While the corpus does not have to be literary in nature, please incorporate into your analysis the critical foundation we build in class.

In preparation for the final paper, students will be required to fulfill several milestones. During Week 8, students will meet with an instructor outside of class to consult on texts, interpretive problems, and statistical methods of interest. In Week 9, students will submit a one-paragraph ~250 word proposal for their final project including these three elements. We will meet again during Week 11 to discuss progress and obstacles in the project, as well as any findings. In Week 12, students submit one page describing their methods and statistical findings, including one visualization.

In keeping with the best practices of the field, students will be required to make available their data set (pending copyright) and code through GitHub. Preliminary code will be posted during Week 10 and final code – capable of reproducing your findings – before our last class. Please send me the link to your materials before this class so I can create an image and we can all run your code together!

During our final class, students will deliver a 3-5 minute elevator pitch describing the challenge being explored and any decisions made or roadblocks faced while applying statistical methods in literature. This will act as a kind of rough draft for the paper, as well as offer an opportunity for feedback from your peers. The final draft of the paper is due on December 11.

Grading Rubric:

	Category	Weight
Participation:	Discussion	20%
	Readings Presentation	10%
Projects:	Close Reading Paper	15%
	Project Milestones	25%
	Final Paper	25%
Assignments:	Notebook Exercises	5%

This course was adapted from @teddyroland 's connector course. Thank you!

kseniyausovich/Rediscovering-Text-as-Data

Rediscovering Text as Data