unicode-org/lstm_word_segmentation

pick_lstm_model parameters are too complicated to call

FrankYFTang opened this issue · 9 comments

I have the following simple program to see how to run all different models under

https://github.com/unicode-org/lstm_word_segmentation/tree/master/Models

It currently work for Thai_codepoints_exclusive_model4_heavy but I have problem to figure out what the value need to be passed in for other model

# Lint as: python3
from lstm_word_segmentation.word_segmenter import pick_lstm_model
import sys, getopt

"""
Read a file and output segmented results
"""

def main(argv):
   inputfile = ''
   outputfile = ''
   try:
     opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
   except getopt.GetoptError:
     print('test.py -i <inputfile> -o <outputfile>')
     sys.exit(2)
   for opt, arg in opts:
      if opt == '-h':
        print('test.py -i <inputfile> -o <outputfile>')
        sys.exit()
      elif opt in ("-i", "--ifile"):
        inputfile = arg
      elif opt in ("-o", "--ofile"):
        outputfile = arg
   print('Input file is "', inputfile)
   print('Output file is "', outputfile)

   file1 = open(inputfile, 'r')
   Lines = file1.readlines()

   word_segmenter = pick_lstm_model(model_name="Thai_codepoints_exclusive_model4_heavy",
                                    embedding="codepoints",
                                    train_data="exclusive BEST",
                                    eval_data="exclusive BEST")

   count = 0
   # Strips the newline character
   for line in Lines:
       line = line.strip()
       print(line)
       print(word_segmenter.segment_arbitrary_line(line))

if __name__ == "__main__":
    main(sys.argv[1:])

Could you specify what values should be used for embedding, train_data and eval_data for the other models?

Burmese_codepoints_exclusive_model4_heavy
Burmese_codepoints_exclusive_model5_heavy
Burmese_codepoints_exclusive_model7_heavy
Burmese_genvec1235_model4_heavy
Burmese_graphclust_model4_heavy
Burmese_graphclust_model5_heavy
Burmese_graphclust_model7_heavy
Thai_codepoints_exclusive_model4_heavy
Thai_codepoints_exclusive_model5_heavy
Thai_codepoints_exclusive_model7_heavy
Thai_genvec123_model5_heavy
Thai_graphclust_model4_heavy
Thai_graphclust_model5_heavy
Thai_graphclust_model7_heavy

or is there a simple way we can just have a simple function

get_lstm_model(model_name) on top of pick_lstm_model() and just fill the necessary parameter to call pick_lstm_model()

In this document under "input_name" I explain the relationship between the name of the models and hyperparameters. For pick_lstm_model it's actually much simpler: embedding should be the embedding that appears in name of the model, e.g. if you have codepoints in name of the model we need embedding="codepoints" and if we have graphclust in name of the model we need embedding="grapheme_clusters_tf". The choice of train_data and eval_data shouldn't be important if you are segmenting arbitrary lines (by calling segment_arbitrary_line function) which is what I see in your code. However, if you want to train and evaluate using BEST data or my.txt file, you need to set train_data and eval_data to appropriate values that I explained in the link above.

In fact, it is possible to get rid of the variable embedding for pick_lstm_model if it is guaranteed that any trained model in future follows the naming convection I explained in this link, but I just left it there because I wasn't sure if that's the case.

Yes, apparently an old version of dictionaries was on our shared Google drive and I didn't notice it. Sorry if it wasted some of your time. I updated the *.ratio files on our drive. I checked the updated file "Thai_graphclust_ratio.npy" and it seems to give the same numbers that you mentioned above.

So the python code that you ran was flawed (and I guess you get lower accuracy there), but whatever we had in the json files was up to date.

@sffc this should not affect our model performance in Rust, that's probably why we didn't spot it sooner.

I made a commit that does this and left a comment for you there. I forgot to submit a PR, but I basically just changed the files and those lines of code that read/write dictionaries. Please see my commit.

I also updated our Google drive accordingly.