pick_lstm_model parameters are too complicated to call
FrankYFTang opened this issue · 9 comments
I have the following simple program to see how to run all different models under
https://github.com/unicode-org/lstm_word_segmentation/tree/master/Models
It currently work for Thai_codepoints_exclusive_model4_heavy but I have problem to figure out what the value need to be passed in for other model
# Lint as: python3
from lstm_word_segmentation.word_segmenter import pick_lstm_model
import sys, getopt
"""
Read a file and output segmented results
"""
def main(argv):
inputfile = ''
outputfile = ''
try:
opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
except getopt.GetoptError:
print('test.py -i <inputfile> -o <outputfile>')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print('test.py -i <inputfile> -o <outputfile>')
sys.exit()
elif opt in ("-i", "--ifile"):
inputfile = arg
elif opt in ("-o", "--ofile"):
outputfile = arg
print('Input file is "', inputfile)
print('Output file is "', outputfile)
file1 = open(inputfile, 'r')
Lines = file1.readlines()
word_segmenter = pick_lstm_model(model_name="Thai_codepoints_exclusive_model4_heavy",
embedding="codepoints",
train_data="exclusive BEST",
eval_data="exclusive BEST")
count = 0
# Strips the newline character
for line in Lines:
line = line.strip()
print(line)
print(word_segmenter.segment_arbitrary_line(line))
if __name__ == "__main__":
main(sys.argv[1:])
Could you specify what values should be used for embedding, train_data and eval_data for the other models?
Burmese_codepoints_exclusive_model4_heavy
Burmese_codepoints_exclusive_model5_heavy
Burmese_codepoints_exclusive_model7_heavy
Burmese_genvec1235_model4_heavy
Burmese_graphclust_model4_heavy
Burmese_graphclust_model5_heavy
Burmese_graphclust_model7_heavy
Thai_codepoints_exclusive_model4_heavy
Thai_codepoints_exclusive_model5_heavy
Thai_codepoints_exclusive_model7_heavy
Thai_genvec123_model5_heavy
Thai_graphclust_model4_heavy
Thai_graphclust_model5_heavy
Thai_graphclust_model7_heavy
or is there a simple way we can just have a simple function
get_lstm_model(model_name) on top of pick_lstm_model() and just fill the necessary parameter to call pick_lstm_model()
In this document under "input_name" I explain the relationship between the name of the models and hyperparameters. For pick_lstm_model
it's actually much simpler: embedding should be the embedding that appears in name of the model, e.g. if you have codepoints
in name of the model we need embedding="codepoints"
and if we have graphclust
in name of the model we need embedding="grapheme_clusters_tf"
. The choice of train_data
and eval_data
shouldn't be important if you are segmenting arbitrary lines (by calling segment_arbitrary_line
function) which is what I see in your code. However, if you want to train and evaluate using BEST data or my.txt file, you need to set train_data
and eval_data
to appropriate values that I explained in the link above.
In fact, it is possible to get rid of the variable embedding
for pick_lstm_model
if it is guaranteed that any trained model in future follows the naming convection I explained in this link, but I just left it there because I wasn't sure if that's the case.
Yes, apparently an old version of dictionaries was on our shared Google drive and I didn't notice it. Sorry if it wasted some of your time. I updated the *.ratio files on our drive. I checked the updated file "Thai_graphclust_ratio.npy" and it seems to give the same numbers that you mentioned above.
So the python code that you ran was flawed (and I guess you get lower accuracy there), but whatever we had in the json files was up to date.
@sffc this should not affect our model performance in Rust, that's probably why we didn't spot it sooner.
I made a commit that does this and left a comment for you there. I forgot to submit a PR, but I basically just changed the files and those lines of code that read/write dictionaries. Please see my commit.
I also updated our Google drive accordingly.