/word-and-character-frequencies

A collection of English word & character frequency count analyses, as well as corpus links

Word and Character Frequency Counts

MDickens Personal Data

An informal analysis, but one of my favorites. Contains character frequency analyses both including and not including coding text in the input corpus. It also has some word frequency AND symbol frequency! Symbol frequency can be surprisingly hard to find.

Screenshot 2023-07-29 at 3 31 59 PM Example screenshot from site

https://mdickens.me/typing/letter_frequency.html

Lydell Bigram Frequencies

bigram frequencies along with the code of how he did it - very useful

https://gist.github.com/lydell/c439049abac2c9226e53

Google Books Ngram Project

Many downloadable files containing info on how often different Ngrams occur in the google corpus.

Incredible amount of data, but presented very unintuitively. Largely unhelpful imo

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

Peter Norvig Data Analysis

Peter took the Google Ngram data and made it useful. Basically a distillation of all the important stuff you would want to know, presented in a much more helpful format. Much better.

Screenshot 2023-07-29 at 3 33 09 PM Example screenshot from site

http://norvig.com/mayzner.html

UCREL Data

Very interesting site with several frequency lists for both written and spoken English, broken down in several ways which are not commonly found in other data.

The simplest and most directly applicable to keyboard typing is the Word Frequency In Written English list.

https://ucrel.lancs.ac.uk/bncfreq/flists.html

Vivian Cook Data

What can we say about Vivian Cook. Idk who this woman is, or why she put this data together. Her website is terrible. But her data is incredible.

Screenshot 2023-07-29 at 3 35 20 PM Example screenshot from site

Punctuation frequency

Frequency of letter position in word (this is very hard to find elsewhere!)

Bigram Contact Chart (how often two letters touch each other)

BONUS: I took her contact chart data and made this bigram heatmap out of it: bigram heatmap

Corpora Links

The English Corpora website will allow you to browse and download a lot of different corpora like the Wikipedia Corpus or the Corpus of American Soap Operas

They don't have everything, and some cost money. So here are some other options: