/custom-ngrams-search

Do your own, better analysis of Google Books Ngram data. WIP

Primary LanguageHaskell

Custom Google Ngrams Search

The Google Ngrams Viewer is really neat. You can use it to compare the frequency of certain terms in millions of books across a timeseries of centuries.

But what if you have an advanced use-case, or run up against limitations in the Google interface? For example, maybe you want to:

  • Search for bear as a noun (i.e. the animal) instead of bear as a verb (i.e. to endure).
  • Search for expressions where one word is unknown, like “capable of ___ing,” and fill in the blanks with all possible responses.
  • Do a regular expression search.

CLI Usage

You can run this yourself if you have lots of space free on your machine.

  1. Make sure you have the Nix package manager installed. On Linux or MacOS, you can usually do this with
$ sh <(curl https://nixos.org/nix/install)

Then run:

nix-shell
  1. Download the data from Google, using the download script:
cd src
runhaskell DownloadData

There are a lot of very big files. Not only are the gzipped files hude, but you will need 3GB of space free per gzip file, for the database. That results in a ~140GB database file, just for unigrams and bigrams.

  1. Generate the database:
runhaskell DB
  1. Or, run the regex-based script:
# From the project root directory
nix-shell --run "cabal run custom-ngrams-search -- -q searchTerm"