explosion/sense2vec

sense2vec: TypeError: write_json() missing 1 required positional argument: 'data'

Z-e-e opened this issue · 7 comments

Z-e-e commented

"I'm having some trouble in training my own vectors. I used the scripts posted above in the bin folder. The preprocess step worked fine, but I'm getting an error in the train step.

Here is the the call I make to the train function:

train(in_dir = '/home/portnows/preprocessed', out_file = '/home/portnows/trained')

Traceback (most recent call last):
File "/home/portnows/tmp/Rtmp5o4CVd/chunk-code-bf0842242e06.txt", line 41, in
train(in_dir = '/home/portnows/preprocessed', out_file = '/home/portnows/trained')
File "/home/portnows/tmp/Rtmp5o4CVd/chunk-code-bf0842242e06.txt", line 37, in train
vector_map.save(out_file)
File "vectors.pyx", line 195, in sense2vec.vectors.VectorMap.save
TypeError: write_json() missing 1 required positional argument: 'data'

Any ideas what might be causing this? The documentation makes it seem that out_file should be a directory where json gets stored, but should it actually be a file?

Originally posted by @SamPortnow in #36 (comment)"

I am having the same issue. Has there been a solution. Links provided in response from October,2019 are broken.

Are you using the training scripts or making calls directly to train? And are you using GloVe or fasttext? I just trained some using the scripts without that error.

Z-e-e commented

I am using sense2vec==1.0.0a1 version and this script that was provided in one of the previous issues and comments referenced above:

w2v_model = Word2Vec(
size=size,
window=window,
min_count=min_count,
workers=workers,
sample=1e-5,
negative=negative,
iter=epochs
)

sentences = PathLineSentences(path)

print("building the vocabulary...")
w2v_model.build_vocab(sentences)

print("training the model...")
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=w2v_model.iter)

print("creating the sense2vec model...")
vector_map = VectorMap(size)

for string in w2v_model.wv.vocab:
    vocab = w2v_model.wv.vocab[string]
    freq, idx = vocab.count, vocab.index
    if freq < min_count:
        continue
    vector = w2v_model.wv.vectors[idx]
    vector_map.borrow(string, freq, vector)

print("saving the model to file...")
vector_map.save(out_path)

It works, it is just that the freqs.json is never produced, the model is not finalized. I am using this script because the newer training scripts involve fasttext, which have proved to be very difficult to use on windows and Jupyter. I parsed and pre-processed the text using the newer scripts.I am new to all of this, so I understand that I may be missing something here.

You can use either GloVe or fasttext with the next scripts (see step 4). I just trained vectors using GloVe. I think the codebase has changed a lot since v1, so I think debugging it might be difficult.

Z-e-e commented

That makes sense. I will give it a shot today thanks.

Z-e-e wrote:

@ahalterman are you able to share this part with me, in terms of input:

fasttext_bin=("Path to the fasttext binary", "positional", None, str),
in_dir=("Directory with preprocessed .s2v files", "positional", None, str),
out_dir=("Path to output directory", "positional", None, str),
n_threads=("Number of threads", "option", "t", int),
min_count=("Minimum count for inclusion in vocab", "option", "c", int),
vector_size=("Dimension of word vector representations", "option", "s", int),
verbose=("Set verbosity: 0, 1, or 2", "option", "v", int),
)

Are you trying to use GloVe or fasttext? Here was the call I did for this step using GloVe:

python sense2vec/scripts/04_glove_train_vectors.py glove/build glove_counts/cooccurrence.shuf.bin glove_counts/vocab.txt vectors/
Z-e-e commented

@ahalterman I am using fasttext.

I think you should try the fasttext-specific training script in step 4 and see how it goes! I've only used GloVe so I don't know any specifics about training fasttext.