Chapter 4 corrections
Opened this issue · 9 comments
In this chapter, we show how to create a naive Bayes model. Check if the definition of Naive Bayes is clear enough.
We also describe this as a classification problem where we use supervised learning. Add a definition for both terms.
For example,
As we mentioned, what we are facing here is a classification problem, which means ...
Review the definition of "imbalanced dataset"
How does good training data look like? In this context, it means having a large enough corpus of emails that have been pre-classified as spam or ham, where the emails are collected from a sufficiently heterogeneous group of people (after all, spam is a somewhat subjective category based on what's irrelevant to you-- this will differ between users), and that the proportion of spam vs. ham in our data is somewhat representative of the real proportion of mails we receive.
I propose using this opportunity to explain how pre-classified data enables the model to "learn" enough to do predictions, for example:
How does good training data look like? In this context, it means having a large enough corpus of emails that have been pre-classified (as spam or ham. This means, emails are already tagged as spam or ham. It is important to have this training data pre-classified (this step is frequently done by a human) so our model can "learn" to associate the target variable (email type) with the input variables (words contained in the email). For example, a successful model might infer from the training data that emails containing the word "discount" have a high probability of being spam.
I would modify this to also include a definition of "training" and "target" data/variables if it's not included in the book already
Imagine for a moment that our training set of emails consists just of these three emails
This example might be overly simplistic, as there are no ham emails in the dataset so the model would just classify everything as spam as far as I know.
Also, the definition of lemmatization is getting in the way of the explanation, the paragraph needs better organization
the columns of this matrix will be each mail now
Improve wording of this code comment
The probability of a particular word, given that we have a spam email, can be calculated like so,
I propose:
The probability of a particular word being associated to spam or ham can be calculated like so,
$α$ is just a smoothing parameter, so that the probability of words that, for example, are not in the spam category don't give 0 probability.
I propose
$α$ is just a smoothing parameter, so that the probability of words that, for example, are not in the spam category doesn't go down to 0.
I would also look into this phrase "the probability of words" as it doesn't sound precise enough, at least to me.
We are now almost ready to make some predictions and test our model. The function below is just the implementation of the formula TAL that we have already talked about.
There's no prior discussion of this formula, fix this.
I think this chapter is the first time we mention the train/test split and the fit - predict - test workflow. We should provide a better explanation of why we work this way