AliMorty/Text-Classification

processing

un-lock-me opened this issue · 5 comments

why did you consider each line of a text file as a document?

Thanks.

Hi! Actually it is because in my dataset every document was separated by '\n'.

Thanks for replying back. that would be nice if you have shared a sample of your data set.

I am having difficulty making sense of this part;
p_class_condition_on_not_w[i] = (count_of_that_class[i]-tmp[i])/(number_of_docs-word_occurance_frequency)
do you mind explaining why did you calculate (1-p)log(1-p) in this way?
why did you mix with number of documents?

The confusion for me is that I have 20 classes, and each class 1000 documents, but for my understanding I do not need to consider the number of documents, because the only thing which matters here is the frequency of words in each class versus frequency of words in all classes.

Thanks.