Mallet LDA improvements/changes
johann-petrak opened this issue · 6 comments
johann-petrak commented
A bunch of improvements, changes and checks to do, collected into this single issue:
- make sure we save the right data for later topic inference:
- model file: apparently just for resuming training
- inference file: for application on new documents
- gibbs state: why/when exactly do we need this?
- make sure we use hyperparameter optimization
- check what takes so long, is it calculating the diagnostics? If yes, make saving that optional
- find word per topic probs without running the diagnostics
- save separate files after training:
- words and their weights per topic (ntopics*nwords lines with topicnr, word, weight)
- topic importance (ntopics lines with topicnr, topic weight)
- if we run inference: documents and their topic weights (ndocuments*ntopics lines with docname, topicnr, weight), optional?
- the full diagnostics file, optionally
- save separate files after application:
- documents and their topic weights (as for training), optional?
- Parameter that specifies a feature name prefix for storing global info on every document's document features (not the document annotation!)
- Provide a Groovy script for deriving k-best topic index lists from the topic distribution based on:
- which topics have highest prob
- is the prob > than some threashold
johann-petrak commented
Ad 1.:
- we save the whole model, but for assigning a topic distribution at apply time we only really need to save a TopicInferencer instance. However the model contains that instance, so we just save a lot more than needed. Lets keep that for now.
- the state file is a gzipped text representation but this is also saved as part of the model
- NOTE: instances are also saved as part or the model so eventually we may want to avoid saving the model?
- but so far, what we do seems to be fine
johann-petrak commented
- make sure we use hyperparameter optimization:
- this should happen always anyways, optimizeInterval is set to 50 in ParallelTopicModels
- however when testing with the Mallet command line tool train-topics, if
--optimize-interval 50
is specified, we get interesting topic weights, while not using that parameter gives equal weights and looks indentical to--optimize-interval 0
johann-petrak commented
- check what takes so long
- yes it is calculating the topic model diagnostics, make calculating and storing the file optional, default no
johann-petrak commented
- Has been implemented
johann-petrak commented
5 has been implemented
johann-petrak commented
Still missing:
-
- save separate files after application (topic distribution per document as for training)
-
- parameter for storing per document features (maybe not implement this after all?)
-
- prepared groovy script or PR for assigning topics (have something in private DCMSC repository which could get adapted