Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
Currently, the setup.py and the pip install are both not working! Unfortunately, I suggest you unpack the files yourself, for now. I am actively looking for help fixing that problem!
The preprocessing is all done through the "nlppipe.py" file. Using SpaCy, we have added a lot of functionality. We can pad/cut off our sentences, merge noun phrases, use parallel processing, and load pretrained vectors.
At the most basic level, if you would like to get your data processed for lda2vec, you can do the following:
data_dir = "data"
run_name = "my_run"
# Python list of your text
texts = ["list of your text here", ..., "your text here"]
# Run preprocessing, limiting/padding documents to 100 tokens
utils.run_preprocessing(texts, data_dir, run_name, max_length=100, vectors="en_core_web_sm")
When you run the twenty newsgroups example, it will create a directory tree that looks like this:
├── my_project
│ ├── data
│ │ ├── 20_newsgroups.txt
│ │ └── my_run
│ │ ├── doc_lengths.npy
│ │ ├── embed_matrix.npy
│ │ ├── freqs.npy
│ │ ├── idx_to_word.pickle
│ │ ├── skipgrams.txt
│ │ └── word_to_idx.pickle
│ ├── load_20newsgroups.py
│ └── run_20newsgroups.py
To run the model, pass the same data_path and run_name to the load_preprocessed_data function and then use that data to instantiate and train the model.
data_dir = "data"
run_name = "my_run"
num_topics = 20
num_epochs = 20
# Load preprocessed data
idx_to_word, word_to_idx, freqs, embed_matrix, pivot_ids,
target_ids, doc_ids, num_docs, vocab_size, embed_size) = utils.load_preprocessed_data(data_dir, run_name)
# Instantiate the model
m = model(num_docs,
vocab_size,
num_topics = num_topics,
embedding_size = embed_size,
load_embeds=True,
pretrained_embeddings=embed_matrix,
freqs = freqs)
# Train the model
m.train(pivot_ids,target_ids,doc_ids, len(pivot_ids), num_epochs, idx_to_word = idx_to_word, switch_loss_epoch=5)
We can now visualize the results of our model using pyLDAvis:
utils.generate_ldavis_data(data_path, run_name, m, idx_to_word, freqs, vocab_size)
This will launch pyLDAvis in your browser, allowing you to visualize your results like this: