rcv1.ipynb is not match the description

Question

rcv1.ipynb is not match the description

cloud-waiting-for-wind opened this issue 7 years ago · 1 comments

cloud-waiting-for-wind commented 7 years ago

comment in the rcv1.ipynb said that:
"This left us with 50 classes and 402,738 documents. We divided the documents into equal-sized training and test sets randomly. Each document was represented using the 2000 most frequent non-stopwords in the dataset."

but I find that dataset.data_info() :
N = 420065 documents, M = 47236 words, sparsity=0.0000%

According to the comment in the script(rcv1.ipynb), N should be 402738 and M should be 2000. but in fact it is not match the description

Is there any wrong in rcv1.ipynb ? You have a rcv1 code here, but you don't mention rcv1 dataset in the paper , why?

Answer 1 · 2017-08-09T10:33:42.000Z

Thanks for your interest. The reason of the non-inclusion of rcv1 in the paper and half-finished code and comments is because it was a work-in-progress. I started working on it for its larger corpus (than 20news) but could not finish in time. Sorry for the confusion.