ng-vocab.tsv file not found
Closed this issue · 2 comments
Hi,
I am running this notebook: https://github.com/sujitpal/eeap-examples/blob/master/src/04c-ng-clf-eeap.ipynb
When I run "Load Vocabulary" cell, it gives this error:
<ipython-input-4-e0b7b98e6728> in <module>()
1 word2id = {"PAD": 0, "UNK": 1}
----> 2 fvocab = open(VOCAB_FILE, "rb")
3 for i, line in enumerate(fvocab):
4 word, count = line.strip().split("\t")
5 if int(count) <= MIN_OCCURS:
FileNotFoundError: [Errno 2] No such file or directory: `'../data/ng-vocab.tsv'
I fixed this issue by getting the frequency of the words from the corpus:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import codecs
DATA_DIR = "../data"
VOCAB_SIZE = 40000
ng_data = fetch_20newsgroups(subset="all", data_home=DATA_DIR, shuffle=True, random_state=42)
count_vect = CountVectorizer(max_features=VOCAB_SIZE)
X_train_counts = count_vect.fit_transform(ng_data.data)
with codecs.open('../data/ng-vocab.tsv', 'w', encoding='utf-8') as fw:
for word, count in count_vect.vocabulary_.items():
fw.write('{}\t{}\n'.format(word, count))
Note: This will write the file in text mode and not in binary. I changed the code in the notebooks to read the file in text mode ('rb' to 'r').
I fixed this issue by getting the frequency of the words from the corpus:
//CODE
Note: This will write the file in text mode and not in binary. I changed the code in the notebooks to read the file in text mode ('rb' to 'r').
@riteshpanjwani I have tried what you have suggested. After doing that the code still didn't achieve the result that the author mentioned in the code itself. What was your result? (Please check the issue I opened to get more detail).