rcv1.ipynb is not match the description
cloud-waiting-for-wind opened this issue · 1 comments
comment in the rcv1.ipynb said that:
"This left us with 50 classes and 402,738 documents. We divided the documents into equal-sized training and test sets randomly. Each document was represented using the 2000 most frequent non-stopwords in the dataset."
but I find that dataset.data_info() :
N = 420065 documents, M = 47236 words, sparsity=0.0000%
According to the comment in the script(rcv1.ipynb), N should be 402738 and M should be 2000. but in fact it is not match the description
Is there any wrong in rcv1.ipynb ? You have a rcv1 code here, but you don't mention rcv1 dataset in the paper , why?
Thanks for your interest. The reason of the non-inclusion of rcv1 in the paper and half-finished code and comments is because it was a work-in-progress. I started working on it for its larger corpus (than 20news) but could not finish in time. Sorry for the confusion.