This repository holds the raw text of over 900 U.S. presidential speeches, and some open source code for doing (so far rudimentary) analysis of them.
Almost all the texts come from The Miller Center at the University of Virginia. The Miller Center provides not just the text of the speeches, but audio and video when available. However, since each speech is encapsulated in a single web page and those pages are in a pretty consistent format (thank you, Miller Center!), the raw text for each speech was easily extractable via Emacs macros. See the section "Notes on data formatting and consistency" below for details.
So far I've added one speech from a source other than the Miller Center: this speech at a September 2019 foreign policy press conference comes directly from a White House web page, because for some reason the Miller Center didn't have it yet.
Work in progress; start by running do-all.
Each transcript in the data directory starts with a single line of the form "President: Name Of Some President", followed on the next line (or maybe after some blank lines) by the transcript.
The presidential identifiers appear to be consistent, e.g., LBJ is always "Lyndon B. Johnson", never "Lyndon Johnson". Here's how I checked:
$ grep -h "^President: " data/* | sort | uniq
President: Abraham Lincoln
President: Andrew Jackson
President: Andrew Johnson
President: Barack Obama
President: Benjamin Harrison
President: Bill Clinton
President: Calvin Coolidge
President: Chester A. Arthur
President: Donald Trump
President: Dwight D. Eisenhower
President: Franklin D. Roosevelt
President: Franklin Pierce
President: George H. W. Bush
President: George Washington
President: George W. Bush
President: Gerald Ford
President: Grover Cleveland
President: Harry S. Truman
President: Herbert Hoover
President: James A. Garfield
President: James Buchanan
President: James K. Polk
President: James Madison
President: James Monroe
President: Jimmy Carter
President: John Adams
President: John F. Kennedy
President: John Quincy Adams
President: John Tyler
President: Lyndon B. Johnson
President: Martin Van Buren
President: Millard Fillmore
President: Richard Nixon
President: Ronald Reagan
President: Rutherford B. Hayes
President: Theodore Roosevelt
President: Thomas Jefferson
President: Ulysses S. Grant
President: Warren G. Harding
President: William Harrison
President: William McKinley
President: William Taft
President: Woodrow Wilson
President: Zachary Taylor
$
Each data file's name can be algorithmically transformed back to its original URL:
-
Convert the file's "YYYY-MM-DD" prefix to "monthname-D-YYYY". For example, "2008-11-04" would become "november-4-2008".
-
Strip off the
.txt
from the end of the filename. -
Prepend
https://millercenter.org/the-presidency/presidential-speeches/
.
Below is a list of various consistency and formatting issues I have noticed in the data. Some of these are issues that would likely affect any comparative analysis, and would have to be filtered out or otherwise handled specially. A general solution would be a 'pcat' command ("presidential catenate", like Unix 'cat' but living in the White House I guess) that takes a transcript as input and prints just the President's actual words as output.
-
Double hyphen used for em-dash is inconsistently spaced.
In some speeches--particularly the older ones--the double hyphen without spaces on either side is used em-dash, as in this sentence. In other speeches -- particularly the newer ones -- it's done with spaces on either side, as in this sentence.
-
Debates and press conferences include other speakers.
Many of these transcripts are not speeches but rather debates or press conferences, in which other speakers' words are included. For debates, the moderator and other participants are included, and this can be a significant amount of the text. For press conferences, when it's just the President answering questions from reporters, most of the words are President's with a few words from reporters, but when it is a joint press conference with another politician, e.g., data/1991-07-31-press-conference-mikhail-gorbachev.txt, then a lot of the words in the file are from sources other than the President.
$ ls data/*conference* data/1890-04-19-statement-international-american-conference.txt data/1921-11-12-opening-speech-conference-limitation-armament.txt data/1943-12-24-fireside-chat-27-tehran-and-cairo-conferences.txt data/1945-08-09-radio-report-american-people-potsdam-conference.txt data/1964-02-01-press-conference.txt data/1964-02-29-press-conference-state-department.txt data/1964-03-07-press-conference-white-house.txt data/1964-04-16-press-conference-state-department.txt data/1964-05-06-press-conference-south-lawn.txt data/1964-07-24-press-conference-state-department.txt data/1965-02-04-press-conference.txt data/1965-03-13-press-conference-white-house.txt data/1965-03-20-press-conference-lbj-ranch.txt data/1965-04-27-press-conference-east-room.txt data/1965-06-01-press-conference-east-room.txt data/1965-07-13-press-conference-east-room.txt data/1965-07-28-press-conference.txt data/1965-08-25-press-conference-white-house.txt data/1966-07-05-press-conference-lbj-ranch.txt data/1966-07-20-press-conference-east-room.txt data/1966-10-06-press-conference.txt data/1966-12-31-press-conference.txt data/1967-02-02-press-conference.txt data/1967-03-09-press-conference.txt data/1967-08-18-press-conference.txt data/1967-11-17-press-conference.txt data/1968-04-03-press-conference.txt data/1974-02-25-presidents-news-conference.txt data/1977-03-09-remarks-president-carters-press-conference.txt data/1981-01-29-first-press-conference.txt data/1991-07-31-press-conference-mikhail-gorbachev.txt data/1993-01-29-press-conference-gays-military.txt data/2009-01-12-final-press-conference.txt data/2010-02-09-news-conference-congressional-gridlock.txt data/2010-11-03-press-conference-after-2010-midterm-elections.txt $ ls data/*debate* data/1960-09-26-debate-richard-nixon-chicago.txt data/1960-10-07-debate-richard-nixon-washington-dc.txt data/1960-10-13-debate-richard-nixon-new-york-and-los-angeles.txt data/1960-10-21-debate-richard-nixon-new-york.txt data/1976-09-23-debate-president-gerald-ford-domestic-issues.txt data/1976-10-06-debate-president-gerald-ford-foreign-and.txt data/1976-10-22-debate-president-gerald-ford.txt data/1980-10-28-debate-ronald-reagan.txt data/1984-10-07-debate-walter-mondale-domestic-issues.txt data/1984-10-21-debate-walter-mondale-defense-and-foreign.txt data/1988-09-25-debate-michael-dukakis.txt data/1992-10-11-debate-bill-clinton-and-ross-perot.txt data/1996-10-06-presidential-debate-senator-bob-dole.txt
-
One speech was missing the presidential identifier element.
This speech, given by Barack Obama on 8 Sep 2011, was missing a standardized HTML element that every other speech uses to identify the president who gave the speech. I added the element...
<p class="president-name">Barack Obama</p>
...right before the line indicating the date of the speech, which was present here expected:
<p class="episode-date">September 08, 2011</p>
I reported this 2017-05-14 via https://millercenter.org/contact.
-
19 speeches are missing the HTML element that indicates location.
I haven't added this element, since none of the analysis I'm doing takes location into account, but FWIW 19 speeches (about 2%) were missing the
'<span class="speech-loc">...</span>'
element:$ for name in *; \ do if grep -q '<span class="speech-loc">' ${name}; \ then echo -n ""; else echo "${name}"; \ fi; \ done 1842-08-11-message-senate-negotiations-britain.txt 1965-01-20-inaugural-address.txt 1981-04-28-address-program-economic-recovery.txt 1981-11-18-speech-strategic-arms-reduction-talks.txt 1982-06-09-address-bundestag-west-germany.txt 1982-06-17-speech-united-nations-general-assembly.txt 1982-09-20-address-nation-lebanon.txt 1983-03-23-address-nation-national-security.txt 1983-11-02-speech-creation-martin-luther-king-jr-national.txt 1983-11-04-remarks-us-casualties-lebanon-and-grenada.txt 1984-06-06-40th-anniversary-d-day.txt 1984-10-07-debate-walter-mondale-domestic-issues.txt 1985-05-05-bergen-belsen-concentration-camp.txt 1986-09-14-speech-nation-campaign-against-drug-abuse.txt 1988-08-15-farewell-address-republican-national-convention.txt 1988-12-16-speech-foreign-policy.txt 2004-07-17-remarks-national-security-and-war-effort.txt 2011-05-19-speech-american-diplomacy-middle-east-and-north.txt 2011-10-21-remarks-end-war-iraq.txt $
This element is normally present in the Miller Center web page for each speech, and looks something like this:
<span class="speech-loc">The White House</span>
In most or all of the speeches where it's missing, the location is known. E.g., we clearly know where Ronald Reagan's 1982 speech at the Bundestag took place.
This issue is also mentioned in my 2017-05-14 report to https://millercenter.org/contact.
Note that the raw texts in the data directory do not include the HTML anymore, so the above will not run against current data.
-
The transcript of at least one speech is really a summary.
The transcript of this speech is really a summary (see original at https://millercenter.org/the-presidency/presidential-speeches/february-24-1841-argument-supreme-court-case-united-states-v).
How many more such are there? I don't know yet.
-
Unexpected number and spelling in one speech.
This speech has the number "56" inline in the text, and spells a word "punishmt". Are there other old speeches with odd spelling?
How many other transcripts have issues like that? I don't know yet.
-
Some speeches were actually written out and signed. The say "By the president" or some such at the bottom, and often mention another official who facilitated the transmission. See, e.g., this one. Others are signed "Very respectfully," or some such, e.g., this one.
At the very least, such footer text should be stripped for the purposes of textual analysis, because it's not part of what the President said.