GateNLP/gateplugin-LearningFramework

Mallet LDA improvements/changes

johann-petrak opened this issue · 6 comments

A bunch of improvements, changes and checks to do, collected into this single issue:

  1. make sure we save the right data for later topic inference:
  • model file: apparently just for resuming training
  • inference file: for application on new documents
  • gibbs state: why/when exactly do we need this?
  1. make sure we use hyperparameter optimization
  2. check what takes so long, is it calculating the diagnostics? If yes, make saving that optional
  3. find word per topic probs without running the diagnostics
  4. save separate files after training:
    • words and their weights per topic (ntopics*nwords lines with topicnr, word, weight)
    • topic importance (ntopics lines with topicnr, topic weight)
    • if we run inference: documents and their topic weights (ndocuments*ntopics lines with docname, topicnr, weight), optional?
    • the full diagnostics file, optionally
  5. save separate files after application:
  • documents and their topic weights (as for training), optional?
  1. Parameter that specifies a feature name prefix for storing global info on every document's document features (not the document annotation!)
  2. Provide a Groovy script for deriving k-best topic index lists from the topic distribution based on:
  • which topics have highest prob
  • is the prob > than some threashold

Ad 1.:

  • we save the whole model, but for assigning a topic distribution at apply time we only really need to save a TopicInferencer instance. However the model contains that instance, so we just save a lot more than needed. Lets keep that for now.
  • the state file is a gzipped text representation but this is also saved as part of the model
  • NOTE: instances are also saved as part or the model so eventually we may want to avoid saving the model?
  • but so far, what we do seems to be fine
  1. make sure we use hyperparameter optimization:
  • this should happen always anyways, optimizeInterval is set to 50 in ParallelTopicModels
  • however when testing with the Mallet command line tool train-topics, if --optimize-interval 50 is specified, we get interesting topic weights, while not using that parameter gives equal weights and looks indentical to --optimize-interval 0
  1. check what takes so long
  • yes it is calculating the topic model diagnostics, make calculating and storing the file optional, default no
  1. Has been implemented

5 has been implemented

Still missing:

    1. save separate files after application (topic distribution per document as for training)
    1. parameter for storing per document features (maybe not implement this after all?)
    1. prepared groovy script or PR for assigning topics (have something in private DCMSC repository which could get adapted