Introduction to Computational Literary Analysis, Summer 2019

Instructor: Jonathan Reeve
Room: Dwinelle Hall, Room 229
Office: D-Lab Collaborative Space, Barrows Hall 356
Office Hours: Fridays, 12pm-2pm, or by appointment
Email address: jonathan.reeve@columbia.edu
Course website: https://github.com/JonathanReeve/course-computational-literary-analysis
Course description via UC-Berkeley
Course chatroom: https://gitter.im/course-computational-literary-analysis/2019
Readings: https://course-computational-literary-analysis-2019.netlify.com/

Description

This course is an introduction to computational literary analysis, one which presumes no background in programming or computer science. We will cover many of the topics of an introductory course in natural language processing or computational linguistics, but center our inquiries around literary critical questions. We will attempt to answer questions such as:

Did Shakespeare write the so-called Shakespeare apocrypha plays?
What are the characteristic speech patterns of the narrators in Wilkie Collins's The Moonstone?
What words are most frequently used to describe Katherine Mansfield's female characters?
Which novels of the nineteenth century are the most similar to each other? Which are the most different?

The course will teach techniques of text analysis using the Python programming language. Special topics to be covered include authorship detection (stylometry), topic modeling, and word embeddings. Literary works to be read and analyzed will be Wilkie Collins's The Moonstone, Katherine Mansfield's The Garden Party and Other Stories, and James Joyce's Dubliners.

Objectives

Although this course is focused on the analysis of literature, and British literature in particular, the skills you will learn may be used to computationally analyze any text. These are skills transferable to other areas of the digital humanities, as well as computational linguistics, computational social science, and the computer science field of natural language processing. There are also potential applications across the humanistic disciplines—history, philosophy, art history, and cinema studies, to name a few. Furthermore, text- and data-analysis skills are widely desired in today's world. Companies like Google and Facebook, for instance, need ways to teach computers to understand their users' search queries, documents, and sometimes books. The techniques taught in this course help computers and humans to understand language, culture, and human interactions. This deepens our understanding of literature, of our fellow humans, and the world around us.

Prerequisites

This course presumes no prior knowledge of programming, computer science, or quantitative disciplines. Those with programming experience, however, won't find this boring: the level of specialization is such that only the first week covers the basics.

Resources

The best resource for this course is the course GitHub repository. That repo will always contain the most up-to-date copy of this course syllabus, which is subject to change. We will also have a Gitter chatroom for any questions you might have along the way, especially those that you think might be able to be answered by other students. Check out what's happening on Gitter as often as you can, and ask any questions you have there, first. You'll probably have to sign up for Gitter with a GitHub username, if you don't already have one. Unless you're already well established on GitHub, please use your real name as your GitHub/Gitter username. (Mine is JonathanReeve, for example.)

If you want a second opinion about a question, or have questions that we can't answer in the chatroom, a good website for getting help with programming is StackOverflow. Also, the Internet is full of Python learning resources. One of my favorites is CodeCademy, which has a game-like interactive interface, badges, and more. If you like a good puzzle, and like being challenged, there's also the older Python Challenge.

Resources related to text analysis include, but are by no means limited to:

Requirements

Coursework falls into three categories:

Daily Annotations (30% of final grade)
Weekly Homeworks (40% of final grade)
Final project (30% of final grade)

Additionally, there are three course readings: one novel and two short story collections. Reading these closely will help you to contextualize the quantitative analyses, and will prepare you for the close reading tasks of the final paper.

Readings

All readings will be provided in digital form on the course GitHub repository, but if you prefer to read on paper, or to supplement your reading with background information and critical articles, I highly recommend the Broadview and Norton Critical Editions:

Wilkie Collins, The Moonstone, Broadview Edition
- Available as paperback, pdf, or epub at Broadview Press
Katherine Mansfield, The Garden Party and Other Stories, in Katherine Mansfield's Selected Stories, Norton Critical Edition
- Available as Katherine Mansfield's Selected Stories, in paperback from Norton Critical Editions
James Joyce, Dubliners, Norton Critical Edition
- Available as paperback from Norton Critical Editions

Annotations

For each reading assignment, please write at least two annotations to our editions of the text, using hypothes.is. Links are provided below. You'll have to sign up for a hypothes.is account first. As above, please use your real name as your username, so I know who you are. You may write about anything you want, but it will help your final project to think about ways in which computational analysis might help you to better understand what you observe in the text. Good annotations are:

Concise (think: a long tweet)
Well-written (although not too formal)
Observant (rather than evaluative)

You may respond to another student's annotation for one of your two, if you want.

Homework

Four short homework assignments, of 3-15 questions each, will be assigned weekly, and are due on Monday the following week. Jupyter notebook templates for each will be provided. Since we'll review the homework answers at the beginning of each week, late work cannot be accepted. There will be no homework due on the Monday of the last week, to give you more time to work on your final projects.

Submit homework to me at my email address above.

Final Project / Paper

The final project will be a literary argument, presented in the form of a short academic paper, created from the application of one or more of the text analysis techniques we have learned toward the analysis of a text or corpus of your choosing. Should you choose to work with a text or corpus other than the ones we've discussed in class, please clear it with me beforehand. Your paper should be single a Jupyter notebook, including prose in Markdown, code in Python, in-text citations, and a bibliography. A template will be provided. The length should be about the equivalent of an 6- to 8-page printed paper. You're allowed a maximum of three figures, so produce plots selectively. A word count function will be provided in the Jupyter notebook template.

During the final week of class, we'll have final project presentations. Your paper isn't required to be complete by then, but you'll be expected to speak about your project for about 5-7 minutes. Consider it a conference presentation.

Final papers will be evaluated according to the:

Quality of the literary critical argument presented
Quality of the close readings of the text or corpus
Quality of the Python text analysis
Literary interpretation of the results
Integration of the computational analysis with the literary argument

As with homework, please email me your final projects. You may optionally submit your final project to the course repository on GitHub, making it public, for a 5% bonus.

Attendance

Attendance is crucial. Although most course materials will be published in the course GitHub repository, they cannot replace hands-on experience with the techniques this course teaches. This is doubly true of in-class discussions of our readings. If you can't make it to a class for some reason, please let me know in advance, and arrange to get notes from a classmate.

Schedule

Note: this schedule is subject to some change, so please check the course website for the most up-to-date version.

Week 1: Introduction to Python for Text Analysis

Text: Wilkie Collins, The Moonstone Tools: Python (Anaconda)

Unit 1.1 (7/8): Course intro. Motivation: what is possible with computational literary analysis?
Unit 1.2 (7/9): Installing Python. Python 2 v. 3. Jupyter. Strings.
- Text: The Moonstone, First Period, Through Chapter IX
Unit 1.3 (7/10): Working with strings, lists, and dictionaries.
- Text: First Period, Through Chapter XV
Unit 1.4 (7/11): Python basics, continued. Homework 1 assigned.
- Text: Reread part of The Moonstone, paying special attention to themes and motifs.

Week 2: Basic Text Analysis

Text: The Moonstone, Continued Tools: Natural Language ToolKit (NLTK)

Unit 2.1: Review of Week 1 and Homework 1. Loading and manipulating plain text files.
- Text: First Period, Complete.
- Homework 1 due
Unit 2.2: Working with words. Tokenization techniques. Lemmatizers.
- Text: Second Period, First Narrative
Unit 2.3: Basic text statistics with the NLTK. Type / token ratios. Loops, functions, and other control structures.
- Text: Second Period, Second Narrative
Unit 2.4: More text statistics. Concordances, collocations, dispersion plots.
- Text: Second Period, Third Narrative

Week 3: Word Frequency Analyses

Text: The Moonstone and Katherine Mansfield, The Garden Party and Other Stories Tools: Scikit-Learn, Pandas

Unit 3.1: Review of Week 2 and Homework 2. Numpy, Pandas, and narrative time.
- Homework 2 due
- Text: Second Period, Fourth and Fifth Narratives
Unit 3.2: N-grams and part-of-speech analyses.
- Text: The Moonstone, Complete
Unit 3.3: WordNet and WordNet-based text analysis.
- Texts: "The Garden Party"
Unit 3.4: Downloading, using, and iterating over corpora.
- Texts: "The Daughters of the Late Colonel,"

Week 4: Linguistic Techniques I

Text: Katherine Mansfield, The Garden Party and Other Stories Tools: NLTK, SpaCy

Unit 4.1: Review of Week 3 and Homework 3. Corpus vectorization with Scikit-Learn. TF-IDF. Stylometry.
- Homework 3 due
- Texts: "The Young Girl,"
Unit 4.2: Comparative stylometry. Corpus-DB.
- Texts: "Marriage à la Mode,"
Unit 4.3: Stylometry, continued.
- Texts: "Her First Ball,"
Unit 4.4: Topic modeling with LDA. Quote parsing.
- Texts: "An Ideal Family,"

Week 5: Linguistic Techniques II

Text: James Joyce, Dubliners Tools: SpaCy

Unit 5.1: Review of Week 4 and Homework 4. Using SpaCy. Named entity recognition.
- Homework 4 due
- Texts: "The Sisters," "An Encounter"
Unit 5.2: Intro to final project. Sentiment analysis. Macro-etymological analysis.
- Texts: "Araby", "Eveline"
Unit 5.3: Sentence structure analysis using SpaCy.
- Texts: "The Boarding House,"
Unit 5.4: Extras: TEI XML, APIs
- Texts: "Clay"

Week 6: Advanced Topics

Text: James Joyce, Dubliners Tools: Scikit-Learn, SpaCy

Unit 6.1: Review of Week 5. Writing tips.
Unit 6.2: Extras: Social Network Analysis Example
Unit 6.3: Final project presentations.
Unit 6.4: Final project presentations continued. Wrap-up.
- Final project due.

Allison-1999/course-computational-literary-analysis