The Google Ngrams Viewer is really neat. You can use it to compare the frequency of certain terms in millions of books across a timeseries of centuries.
But what if you have an advanced use-case, or run up against limitations in the Google interface? For example, maybe you want to:
- Search for bear as a noun (i.e. the animal) instead of bear as a verb (i.e. to endure).
- Search for expressions where one word is unknown, like “capable of ___ing,” and fill in the blanks with all possible responses.
- Do a regular expression search.
You can run this yourself if you have lots of space free on your machine.
- Make sure you have the Nix package manager installed. On Linux or MacOS, you can usually do this with
$ sh <(curl https://nixos.org/nix/install)
Then run:
nix-shell
- Download the data from Google, using the download script:
cd src
runhaskell DownloadData
There are a lot of very big files. Not only are the gzipped files hude, but you will need 3GB of space free per gzip file, for the database. That results in a ~140GB database file, just for unigrams and bigrams.
- Generate the database:
runhaskell DB
- Or, run the regex-based script:
# From the project root directory
nix-shell --run "cabal run custom-ngrams-search -- -q searchTerm"