Dictionary text analysis tools for Coherence, Augmentation, Validation and Analysis

Primary LanguageRMIT LicenseMIT


Ça va, CAVA? Dictionary Coherence, Augmentation, (Validation and Analysis)

CAVA is an R package to assit in working with dictionary (keywords/lexical text analysis) in a valid way. It allows you to use an embeddings model to do dictionary expansion/augmentation, check its coherence, (and at some future date) validation and analysis.

For a longer description, see our ICA Tool Demo abstract.

Installing and obtaining an embeddings model

You can install CAVA from github:


Before starting, you need an embeddings model. Currently, we only support Fasttext .bin models. The code below downloads an English fasttext model:

if (!file.exists("cc.en.300.bin")) {
  url = "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz"
  options(timeout=4300)  # assuming 1Mb/s
  download.file(paste0(url), destfile = "cc.en.300.bin.gz")
  R.utils::gunzip("cc.en.300.bin.gz") # Install R.utils if needed

Using CAVA

The main functions exposed to cava are shown below. For a more elaborate example, please see the example usage file.

Loading the FastText mnodel, using the state of the union speeches as target corpus:

corpus = quanteda::corpus(sotu::sotu_text, docvars = sotu::sotu_meta)
vectors = load_fasttext("cc.en.300.bin", corpus)


Expanding a dictionary using wildcard and similarity:

dictionary = c("fin*", "eco*")
dictionary = expand_wildcards(dictionary, vectors)
candidates = similar_words(dictionary, vectors)
dictionary = c(dictionary, candidates$word[candidates$similarity>.4])
word similarity
investment 0.5263631
investments 0.5070851
monetary 0.5067791
money 0.5034836
ultimately 0.4815412
profitable 0.4813354

Expanding a dictionary using antonyms:

positive = c("good", "nice", "best", "happy")
negative = c("evil", "nasty", "worst", "bad", "unhappy")
candidates = similar_words(positive, vectors, antonyms = negative)
word similarity
great 0.6968333
better 0.6371742
decent 0.6344477
excellent 0.6282948
wonderful 0.6057519
perfect 0.6032647


Computing and plotting pairwise similarities:

similarities = pairwise_similarities(dictionary, vectors)
similarities |> similarity_graph(max_edges=100) |> plot()

Computing similarity to dictionary centroid (sorted with most distances words on top):

similarity_to_centroid(dictionary, vectors) |> head()
word similarity
finished 0.1932090
finish 0.2300643
findings 0.2401340
finest 0.2455174
fine 0.2593122
finds 0.2703327