/google-books-ngram-frequency

Word/n-gram frequency lists for the Google Books Ngram Corpus (v3, all languages) with Python code

Primary LanguagePython

Google Books n-gram frequency lists

This repository provides cleaned lists of the most frequent words and n-grams (sequences of n words), including some English translations, of the Google Books Ngram Corpus (v3/20200217, all languages), plus customizable Python code which reproduces these lists.

Lists with n-grams

Lists with the most frequent n-grams are provided separately by language and n. Available languages are Chinese (simplified), English, English Fiction, French, German, Hebrew, Italian, Russian, and Spanish. n ranges from 1 to 5. In the provided lists the language subcorpora are restricted to books published in the years 2010-2019, but in the Python code both this and the number of most frequent n-grams included can be adjusted.

The lists are found in the ngrams directory. For all languages except Hebrew cleaned lists are provided for the

  • 10.000 most frequent 1-grams,
  • 5.000 most frequent 2-grams,
  • 3.000 most frequent 3-grams,
  • 1.000 most frequent 4-grams,
  • 1.000 most frequent 5-grams.

For Hebrew, due to the small corpus size, only the 200 most frequent 4-grams and 80 most frequent 5-grams are provided.

All cleaned lists also contain the number of times each n-gram occurs in the corpus (its frequency, column freq). For 1-grams (words) there are two additional columns:

  1. cumshare which for each word contains the cumulative share of all words in the corpus made up by that word and all more frequent words.
  2. en which contains the English translation of the word obtained using the Google Cloud Translate API (only for non-English languages).

Here are the first 10 rows of 1grams_french.csv:

ngram freq cumshare en
de 1380202965 0.048 of
la 823756863 0.077 the
et 651571349 0.100 and
le 614855518 0.121 the
à 577644624 0.142 at
l' 527188618 0.160 the
les 503689143 0.178 them
en 390657918 0.191 in
des 384774428 0.205 of the

The lists found directly in the ngrams directory have been cleaned and are intended for use when developing language-learning materials. The sub-directory ngrams/more contains uncleaned and less cleaned versions which might be of use for e.g. linguists:

  • the most frequent raw n-grams as Google stores them (suffixed 0_raw),
  • only keeping entries without part-of-speech (POS) tags (suffixed 1a_no_pos),
  • only keeping entries with POS tags (only for 1-grams, suffixed 1b_with_pos),
  • entries excluded from the final cleaned lists (suffixed 2_removed).

Learning languages with this

To provide some motivation for why leaning the most frequent words first may be a good idea when learning a language, the following graph is provided.

graph_1grams_cumshare_rank_*.svg

For each language, it plots the frequency rank of each 1-gram (i.e. word) on the x-axis and the cumshare on the y-axis. So, for example, after learning the 1000 most frequent French words, one can understand more than 70% of all words, counted with duplicates, occurring in a typical book published between 2010 and 2019 in version 20200217 of the French Google Books Ngram Corpus.

For n-grams other than 1-grams the returns to learning the most frequent ones are not as steep, as there are so many possible combinations of words. Still, people tend to learn better when learning things in context, so one use of them could be to find common example phrases for each 1-gram. Another approach is the following: Say one wants to learn the 1000 most common words in some language. Then one could, for example, create a minimal list of the most common 4-grams which include these 1000 words and learn it.

Although the n-gram lists have been cleaned with language learning in mind and contain some English translations, they are not intended to be used directly for learning, but rather as intermediate resources for developing language learning materials. The provided translations only give the English word closest to the most common meaning of the word. Moreover, depending on ones language learning goals, the Google Books Ngram Corpus might not be the best corpus to base learning materials on – see the next section.

The underlying corpus

This repository is based on the Google Books Ngram Corpus Version 3 (with version identifier 20200217), made available by Google as n-gram lists here. This is also the data that underlies the Google Books Ngram Viewer. The corpus is a subset, selected by Google based on the quality of optical character recognition and metadata, of the books digitized by Google and contains around 6% of all books ever published (1, 2, 3).

When assessing the quality of a corpus, both its size and its representativeness of the kind of material one is interested in are important.

Size

The Google Books Ngram Corpus Version 3 is huge, as this table of the count of words in it by language and subcorpus illustrates:

Language # words, all years # words, 2010-2019
Chinese 10,778,094,737 257,989,144
English 1,997,515,570,677 283,795,232,871
American English 1,167,153,993,435 103,514,367,264
British English 336,950,312,247 45,271,592,771
English Fiction 158,981,617,587 73,746,188,539
French 328,796,168,553 35,216,041,238
German 286,463,423,242 57,039,530,618
Hebrew 7,263,771,123 76,953,586
Italian 120,410,089,963 15,473,063,630
Russian 89,415,200,246 19,648,780,340
Spanish 158,874,356,254 17,573,531,785

Note that these numbers are for the total number of words, not the number of unique words. They also include many words which aren't valid words, but due to the size of the corpus, the cleaning steps are performed on only the most common words, so the number of words that would remain in the whole corpus after cleaning is unavailable. Moreover, Google uploads only words that appear over 40 times, while these counts also include words appearing less often than that.

We see that even after restricting the language subcorpora to books published in the last 10 available years, the number of words is still larger than 15 billion for each subcorpus, except Chinese and Hebrew. This is substantially larger than for all other corpora in common use, which never seem to contain more than a few billion words and often are much smaller than that (4).

Representativeness

When it comes to representativeness, it depends on the intended use. The Google Books Ngram Corpus contains only books (no periodicals, spoken language, websites, etc.). Each book edition is included at most once. Most of these books come from a small number of large university libraries, "over 40" in Version 1 of the corpus, while a smaller share is obtained directly from publishers (1). So, for example, if one intends to use this corpus for learning a language in which one mainly will read books of the kind large university libraries are interested in, then the words in this corpus are likely to be quite representative of the population of words one might encounter in the future.

Python code

The code producing everything is in the python directory. Each .py-file is a script that can be run from the command-line using python python/filename, where the working directory has to be the directory containing the python directory. Every .py-file has a settings section at the top and additional cleaning settings can be specified using the files in python/extra_settings. The default settings have been chosen to make the code run reasonably fast and keep the resulting repository size reasonably small.

Reproducing everything

Optionally, start by running create_source_data_lists.py from the repository root directory to recreate the source-data folder with lists of links to the Google source data files.

Run download_and_extract_most_freq.py from the repository root directory to download each file listed in source-data (a ".gz-file") and extract the most frequent n-grams in it into a list saved in ngrams/more/{lang}/most_freq_ngrams_per_gz_file. To save computer resources each .gz-file is immediately deleted after this. Since the lists of most frequent n-grams per .gz-file still take up around 36GB with the default settings, only one example list is uploaded to GitHub: ngrams_1-00006-of-00024.gz.csv. No cleaning has been performed at this stage, so this is how the raw data looks.

Run gather_and_clean.py to gather all the n-grams into lists of the overall most frequent ones and clean these lists (see the next section for details).

Run google_cloud_translate.py to add English translations to all non-English 1-grams using the Google Cloud Translate API (this requires an API key, see the file header). By default only 1-grams are translated and only to English, but by changing the settings any n-gram can be translated to any language supported by Google. Google randomly capitalizes translations so an attempt is made to correct for this. Moreover, a limited number of manual corrections are applied using manual_translations_1grams.csv.

Finally, graph_1grams_cumshare_rank.py produces graph_1grams_cumshare_rank_light.svg and its dark version.

Cleaning steps performed

All the cleaning steps are performed in gather_and_clean.py, except the cleaning of the Google translations.

The following cleaning steps are performed programmatically:

  1. part-of-speech (POS) tags are removed
  2. 1-grams, French and Italian: contractions are split
  3. trailing "_" is removed
  4. capitalized and uncapitalized n-grams are merged
  5. n-grams consisting of only punctuation and/or numbers are removed
  6. 1-grams, German: words containing only uppercase letters are removed unless in the manually created list of exceptions upcases_to_keep.csv
  7. 1-grams, except German: words starting with an uppercase letter are removed unless in the manually created lists of exceptions upcases_to_keep.csv
  8. 1-grams: one-character words are removed unless in the manually created lists of exceptions onechars_to_keep.csv
  9. German and Russian: contractions are removed
  10. English fiction: remove n-gram if starting with "'"
  11. n-grams with non-word characters other than "'", " ", and "," are removed
    • additionally other than "-" for Russian and "\"" for Hebrew
  12. n-grams with numbers are removed
  13. n-grams in the wrong alphabet are removed.

Moreover, the following cleaning steps have been performed manually, using the English translations to help inform decisions:

  1. 1-grams, all languages: The lists of removed words have been checked for words wrongly removed.
  2. 1-grams, except Hebrew: The cleaned lists have been fully checked for extra words to exclude.
  3. 1-grams, Hebrew: The cleaned lists have been checked for extra words to exclude up to rank 100.
  4. When n-grams wrongly included or excluded were found during the manual cleaning steps above, this was corrected for by either adjusting the programmatic rules, or by adding them to one of the lists of exceptions, or by adding them to the final lists of extra n-grams to exclude.
  5. n-grams in the manually created lists of extra n-grams to exclude have been removed. These lists are in python/extra_settings and named extra_{n}grams_to_exclude.csv.

When manually deciding which words to include and exclude the following rules were applied. Exclude: person names (some exceptions: Jesus, God), city names (some exceptions: if differ a lot from English and are common enough), company names, abbreviations (some exceptions, e.g. ma, pa), word parts, words in the wrong language (except if in common use). Do not exclude: country names, names for ethnic/national groups of people, geographical names (e.g. rivers, oceans), colloquial terms, interjections.

Related work

For a carefully created list of all words in the English part of the Google Books Ngram Corpus (version 20200217), sorted by frequency, see the hackerb9/gwordlist repository.

An extensive list of frequency lists in many languages is available on Wiktionary.

Limitations and known problems

What remains to be done is completing the final manual cleaning for Hebrew and the cleaning of the lists of n-grams for n > 1. An issue that might be addressed in the future is that some of the latter contain "," as a word. It might also be desirable to include more common abbreviations. No separate n-gram lists are provided for the American English and British English corpora. Of course, some errors remain.

The content of this repository is licensed under the Creative Commons Attribution 3.0 Unported License. Issue reports and pull requests are most welcome!