qtalr/book

Update the MASC dataset to that used in Recipe 7

francojc opened this issue · 2 comments

The transformed dataset from Recipe 7 is cleaner.

To remove non-words:

pos

  • CD, FW, LS, SYM

lemma

  • ^\W$

Our first pass at calculating lemma frequency in @exm-eda-masc-count should bring something else to our attention. As we can see among the most frequent lemmas are non-words such as `,`, and `.`. As you can imagine, given the conventions of written and transcriptional language, these types are very frequent. For a frequency analysis focusing on words, however, we should probably remove them. Thinking ahead, there may also be other non-words that we want to remove, such as symbols, numbers, *etc*. Let's take a look at @fig-eda-masc-pos, where I've counted the part-of-speech tags `pos` in the dataset to see what other non-words we might want to remove.

Also, I don't think it is necessary to show nor describe in detail the process of filtering the dataset. Just get to the analysis.

Addressed