processing

Question

processing

un-lock-me opened this issue 6 years ago · 5 comments

un-lock-me commented 6 years ago

why did you consider each line of a text file as a document?

Thanks.

Answer 1 · 2018-11-14T10:23:14.000Z

Hi! Actually it is because in my dataset every document was separated by '\n'.

Answer 2 · 2018-11-14T16:30:59.000Z

Thanks for replying back. that would be nice if you have shared a sample of your data set.

Answer 3 · 2018-11-15T16:32:01.000Z

I am having difficulty making sense of this part;
p_class_condition_on_not_w[i] = (count_of_that_class[i]-tmp[i])/(number_of_docs-word_occurance_frequency)
do you mind explaining why did you calculate (1-p)log(1-p) in this way?
why did you mix with number of documents?

The confusion for me is that I have 20 classes, and each class 1000 documents, but for my understanding I do not need to consider the number of documents, because the only thing which matters here is the frequency of words in each class versus frequency of words in all classes.

Thanks.

Answer 4 · 2018-11-16T15:34:28.000Z

Your welcome! Actually, my dataset is a little weird :D But it has Class label for each document. In fact, the structure of my dataset is something like this: Class@@@@its Context '\n' Class @@@@its Context'\n'

…

On Wed, Nov 14, 2018 at 8:01 PM saria Goudarzvand ***@***.***> wrote: Thanks for replying back. that would be nice if you have shared a sample of your data set. so you mean cls, sep, text = line.partition('@@@@@@@@@@'), you have a file that your documents have been separated by @@@? what about you classes, consider you have five classes, and in each class you have 1000 documents. how did you difrentiate classes versus documents in your source data? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOHO6myHcy8mlRZSENWpq1hy9pdwL1G0ks5uvEVDgaJpZM4Ycxl1> .

-- Ali Mortazavi BSc graduated in Computer Engineering | Amirkabir University of Technology (Tehran Polytechnic) http://ceit.aut.ac.ir/~mortazavi, ali_mortazavi@aut.ac.ir

Answer 5 · 2018-11-16T15:48:26.000Z

(count_of_that_class[i]-tmp[i]) is actually the number of documents in the class j that do not have the word w[j] (number_of_docs-word_occurance_frequency) means number of documents in which the w[j] does not exist and dividing the above numbers can be interpreted as a probability of occurrence of class[i] in the set of documents that do not have w[j] Please note that the j does not appear in the code because it was not necessary. Because Numpy arrays can handle multiple operations through vectorization. For instance, tmp is a 2-d array and tmp[i] is a 1-d array.

…

On Thu, Nov 15, 2018 at 8:02 PM saria Goudarzvand ***@***.***> wrote: I am having difficulty making sense of this part; p_class_condition_on_not_w[i] = (count_of_that_class[i]-tmp[i])/(number_of_docs-word_occurance_frequency) do you mind explaining why did you calculate (1-p)log(1-p) in this way? why did you mix with number of documents? Thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOHO6i9dxYYPOTflqjmp48Ue8hC4VIHqks5uvZcBgaJpZM4Ycxl1> .

-- Ali Mortazavi BSc graduated in Computer Engineering | Amirkabir University of Technology (Tehran Polytechnic) http://ceit.aut.ac.ir/~mortazavi, ali_mortazavi@aut.ac.ir