By Kaspar Beelen and Luke Blaxill
These lectures are part of the Text Mining and Statistics course for Historians. For an overview of the full course content go here.
Run all on Binder
Lecture 0: How to access Notebooks on Binder
Lecture A: Introduction to the course
We start with a brief introduction to the aims and principles of this course: why should a historian bother to learn a programming language for analysing textual and other types of data? Why Python (notebooks) in particular? We also discuss what to expect from this course (and what not?) and give an overview of the skills you will obtain.
Lecture B: Basic Python, a gentle introduction
This notebook starts with a gentle introduction to the basic elements of the Python syntax. We discuss how to create and manipulate variables, and demonstrate common operations. Some topics are more extensively discussed in 'break out' notebooks or in external documentation.
Lecture C: Text and String Methods
Finally, we move on from more fundamental syntax to working with actual text data. In this notebook, we introduce 'string methods', which are Python tools for processing and manipulating text files. We also demonstrate how to open and read text files (at scale).
Lecture A: Processing Texts
This lesson introduces core Python objects such as lists and dictionaries that you will need when processing text files. We discuss the application of Natural Language Processing tools to historical documents. More precisely, we show how to use the NLTK and SpaCy to splitting a text into tokens and analyse the grammatical structure of a sentence with part-of-speech tagging.
Lecture B: Corpus Selection
In this notebook, we introduce techniques for selecting relevant information from large data sets. We discuss how to filter and select information based on their metadata as well as textual content. The strategies covered here allow you to select documents that are relevant to your research question and build question-specific subcorpora,
Lecture C: Corpus Exploration
After building a subcorpus, you need tools to explore and analyse the texts meaningfully. We focus on a wide range of tools provided by the Natural Language Toolkit, such as concordance or Keyword in Context (KWIC), collocation analysis and feature selection. We use reports written by Victorian Medical Officers of Health as a case study.
Lecture D: Trends over Time
The last notebook in the text mining series focuses on studying discursive trends over time. The goal of this notebook is to understand the changing content of British political manifestos.
Lecture A: Exploring DataFrames with Pandas (Part I)
This notebook introduces the Pandas library and explores tools for working programmatically with tabular data in. We have a closer look at realistic and complex metadata derived from the British Library catalogue and demonstrate how you can refine and reorganise information with the goals of studying trends over time.
Lecture B: Exploring DataFrames with Pandas (Part II)
This notebook uses "synthetic" demographic data about age and gender in late Victorian London. We discuss different types of variables and strategies for visualising distributions. We proceed with summarising information using descriptive statistics, such as mean and median. From a historical point of view, we investigate whether men are generally younger than women in late-Victorian London.
Lecture A: Distributions and Hypothesis Testing
In this section, we move from descriptive to inferential statistics. We assess the statistical 'significance' of the gendered differences observed in the previous notebook (on descriptive statistics). We pursue a data-driven and intuitive approach to significance testing. First, We "bootstrap" confidence intervals and then explore permutation for hypothesis testing.
Lecture A: Correlation and Linear Regression
This session has a closer look at modelling the relation between different variables. The first notebook (click here) discusses how to compute and interpret correlation coefficients and then continue with a gentle introduction to linear regression. The goal is to understand variation in lifespans in late-Victorian London. We try to understand if residents in more affluent boroughs tend to live longer?
Lecture B:
The second notebook on linear regression turns to more advanced techniques: Generalised Linear Models (GLMs). We use GLMs to model and predict count outcomes. We explore two case studies in detail: a) gender bias in university applications and b) gender and participation in the British House of Commons.
Lecture A: Supervised Classification
Lecture B: Topic Modelling
Lecture C: Word Vectors