Mallet LDA improvements/changes

Question

johann-petrak opened this issue 6 years ago · 6 comments

A bunch of improvements, changes and checks to do, collected into this single issue:

make sure we use hyperparameter optimization
check what takes so long, is it calculating the diagnostics? If yes, make saving that optional
find word per topic probs without running the diagnostics
save separate files after training:
- words and their weights per topic (ntopics*nwords lines with topicnr, word, weight)
- topic importance (ntopics lines with topicnr, topic weight)
- if we run inference: documents and their topic weights (ndocuments*ntopics lines with docname, topicnr, weight), optional?
- the full diagnostics file, optionally
save separate files after application:

Parameter that specifies a feature name prefix for storing global info on every document's document features (not the document annotation!)
Provide a Groovy script for deriving k-best topic index lists from the topic distribution based on:

Answer 1 · 2018-09-04T13:20:00.000Z

Ad 1.:

we save the whole model, but for assigning a topic distribution at apply time we only really need to save a TopicInferencer instance. However the model contains that instance, so we just save a lot more than needed. Lets keep that for now.
the state file is a gzipped text representation but this is also saved as part of the model
NOTE: instances are also saved as part or the model so eventually we may want to avoid saving the model?
but so far, what we do seems to be fine

Answer 2 · 2018-09-04T14:21:23.000Z

this should happen always anyways, optimizeInterval is set to 50 in ParallelTopicModels
however when testing with the Mallet command line tool train-topics, if --optimize-interval 50 is specified, we get interesting topic weights, while not using that parameter gives equal weights and looks indentical to --optimize-interval 0

Answer 3 · 2018-09-04T14:59:09.000Z

yes it is calculating the topic model diagnostics, make calculating and storing the file optional, default no

Answer 4 · 2018-09-10T12:04:06.000Z

Answer 5 · 2018-09-10T12:04:57.000Z

5 has been implemented

Answer 6 · 2018-09-10T12:06:53.000Z

Still missing:

1. save separate files after application (topic distribution per document as for training)
1. parameter for storing per document features (maybe not implement this after all?)
1. prepared groovy script or PR for assigning topics (have something in private DCMSC repository which could get adapted