codekansas/keras-language-modeling

Bootstrapping issue: No clear path to reproduce results

jim-kukla opened this issue · 7 comments

Thanks for sharing this experiment.

I'm trying to get it working to reproduce your results, but it seems like there's a bootstrapping problem.

  • Running either script produces an error that some required resource doesn't exist in models/.
  • Assuming insurace_qa_eval.py is the top-level script, I've made some modifications to uncomment the "save embeddings" portion of the script, but I'm still waiting for it to finish running.
  • It also looks like it will next need to invoke the __main__ block to produce models/word2vec_100_dim.h5 in insurance_qa_embeddings.py in order to finish bootstrapping.

Is that the right approach for getting this running? If so, I'll open a PR when I've got it all working.

Did you download the dataset from here? I'm not sure which resource it could be. Could you reproduce the error message?

Yep, the word2vec_100_dim.h5 was the output of using Gensim's Word2Vec model merged with the result of training a 100-dimension EmbeddingModel. I haven't formalized this yet, mostly I've been trying out different word embeddings to see what works. I think once something works well I will put the weight file on Github for general use.

I'd appreciate it if you wanted to open a PR for a stand-alone script. Let me know if you have more questions.

I went ahead and added the word embeddings I've been using to Github

Which word2vec output you are considering here ?
When we save gensim word2vec model we get typically following files -
outfilename, outfilename.syn1neg, outfilename.syn0.np, outfilename.syn1.np

Which one maps to ".h5" you mentioned above or word2vec_100_dim.embeddings you uploaded ?
Also couldn't see "word_embeddings.py" where you might have written something related to this.

syn0 is the equivalent of the Keras embedding layer I believe, that's what I've been using. It's really these lines:

weights = np.load('word2vec_100_dim.embeddings')
language_model = model.prediction_model.layers[2]
language_model.layers[2].set_weights([weights])

@codekansas Yes, I did have the insurance_qa_python repo cloned and had all the data_paths set properly.

Thanks for adding those .h5 entries. I'll take a look shortly and let you know if everything's working for me.

Hi, do you mean outfilename.syn0.np = word2vec_100_dim.embeddings?

It might be different depending on your version