Datafable/epu-index

How to compare words

Closed this issue · 1 comments

There are 3 cases where we need to compare words:

  1. Scoring and article by applying a weight to every word in the text.
  2. Counting the number of unique words and determining their term frequency to build a word cloud.
  3. Removing stop words from a text before determining the word frequencies.

How exactly do we compare words? I would propose:

  • Case insensitive
  • Include the following characters: - and & (e.g. in names of political parties).

Drawbacks:

  • Including & will match political parties such as CD&V, but I see no obvious way to match SP.A as including a dot would also append this character to the last word of each sentence.
  • Frequency counts will consider Grieks and Griekse as 2 different words.
  • Possibly difficulties with special characters in names.

Text will be cleaned first to remove punctuation (see #55). All words are then set to lowercase and compared.