unbalancedparentheses/data_science_in_julia_for_hackers

Chapter 4 corrections

Opened this issue · 9 comments

In this chapter, we show how to create a naive Bayes model. Check if the definition of Naive Bayes is clear enough.

We also describe this as a classification problem where we use supervised learning. Add a definition for both terms.

For example,

As we mentioned, what we are facing here is a classification problem, which means ...

Review the definition of "imbalanced dataset"

How does good training data look like? In this context, it means having a large enough corpus of emails that have been pre-classified as spam or ham, where the emails are collected from a sufficiently heterogeneous group of people (after all, spam is a somewhat subjective category based on what's irrelevant to you-- this will differ between users), and that the proportion of spam vs. ham in our data is somewhat representative of the real proportion of mails we receive.

I propose using this opportunity to explain how pre-classified data enables the model to "learn" enough to do predictions, for example:

How does good training data look like? In this context, it means having a large enough corpus of emails that have been pre-classified (as spam or ham. This means, emails are already tagged as spam or ham. It is important to have this training data pre-classified (this step is frequently done by a human) so our model can "learn" to associate the target variable (email type) with the input variables (words contained in the email). For example, a successful model might infer from the training data that emails containing the word "discount" have a high probability of being spam.

I would modify this to also include a definition of "training" and "target" data/variables if it's not included in the book already

Imagine for a moment that our training set of emails consists just of these three emails

This example might be overly simplistic, as there are no ham emails in the dataset so the model would just classify everything as spam as far as I know.

Also, the definition of lemmatization is getting in the way of the explanation, the paragraph needs better organization

the columns of this matrix will be each mail now

Improve wording of this code comment

The probability of a particular word, given that we have a spam email, can be calculated like so,

I propose:

The probability of a particular word being associated to spam or ham can be calculated like so,

$α$ is just a smoothing parameter, so that the probability of words that, for example, are not in the spam category don't give 0 probability.

I propose

$α$ is just a smoothing parameter, so that the probability of words that, for example, are not in the spam category doesn't go down to 0.

I would also look into this phrase "the probability of words" as it doesn't sound precise enough, at least to me.

We are now almost ready to make some predictions and test our model. The function below is just the implementation of the formula TAL that we have already talked about.

There's no prior discussion of this formula, fix this.

I think this chapter is the first time we mention the train/test split and the fit - predict - test workflow. We should provide a better explanation of why we work this way