tmbdev/clstm

load pretrained model assert failed

striversist opened this issue · 21 comments

Load a pretrained model to retrain new samples will cause assert failed in Codec::encode, but start training from scratch, this problem probably not happens.
see related issue #83

After digging the code a little, I found this clue:
from clstmocrtrain.cc main1

if (load_name != "") {
    clstm.load(load_name);
  } else {
    Codec codec;
    trainingset.getCodec(codec);
    print("got", codec.size(), "classes");

If training from scratch, the load_name is empty, so goes to trainingset.getCodec(codec);. In this function, the chain codec.build(gtnames, charsep); -> Codec::set is executed. So the training samples' all codec are inserted into the encoder map.

If loading pretrained model to retrain new samples, the load_name is not empty, the clstm.load(load_name); loads pretrained codec into encoder map. Next in the Codec::encode, if a new sample string contains a new codec(not in the pretrained encoder map), assert(encoder->count(c) > 0); fails.

Hope contributors fix this problem ASAP.

The following change temporarily fix the assert failure. @wanghaisheng

if (load_name != "") {
    clstm.load(load_name);
    trainingset.getCodec(clstm.net->codec);    // Add this line
  } else {
    Codec codec;
    trainingset.getCodec(codec);
    print("got", codec.size(), "classes");

I don't know whether there is other side effect.

i looked into the source code you refer ,it seems after load existing model we should first get codec vector for this model and codec for trainningset ,then combine these two vector into one and using Codec::set (https://github.com/tmbdev/clstm/blob/master/clstm.cc)
at first i want to train against chinese based on the japanese model , i have not tried ,but my trainning data and the existing japanese model used are definitely not the same one

Thanks for reporting.

I don't think it's a good solution to read all the dataset in each loading.

The best solution IMHO would be to encode all the ~128.000 unicode codepoints at the first load.

Update: It's not a good idea, see comments below.

@amitdo With experience, the more unicode codec you load, the slower the training process will be.
So I don't think it's a best solution, no offence.

I don't suggest to actually do training on all those chars...

OK, looking forward to your solution.

In the meantime, don't use your temporary solution. I believe it will mess your model.

Hope contributors fix this problem ASAP.

I don't think this will be easy. The codec determines the size of the network's layers, i.e. there will be weights/connections in the network for each of the characters in the codec. To add new characters not in the original training data during re-training, you would have to modify the structure of the network before training, which is pretty complicated: you'd have to add extra dimensions to a lot of the weight/bias matrices. Is this what you're suggesting, @amitdo?

Is there a problem with registering chars in the model's codec at build time (first time only), even if some of them won't be trained? For example, for Chinese - registering 6000-10,000 symbols.

Is this what you're suggesting, @amitdo?

I missed that sentence.

My answer: Certainly not!

My suggested solution:
The user will have an option to point to a file which will contain all the chars he think he will ever need for a specific model. Some of the chars might not appear in the dataset given as input for the network. Later on the user can find another dataset and do training on it with the existing model. The codec won't be updated.

What do you think about that?

kba commented

I also do not think that there is a sensible approach to extending a trained model for symbols the network was not originally aware of. It is possible to adapt the data structures (e.g. just adding new code points to the codec) but it will result in an inconsistent model unless you fully retrain - which is what you do not want, obviously.

The user will have an option to point to a file which will contain all the chars he think he will ever need for a specific model. Some of the chars might not appear in the dataset given as input for the network. #106 (comment)

This seems a straightforward approach depending on how much providing all possible chars in the codec degrades training performance.

With experience, the more unicode codec you load, the slower the training process will be. #106 (comment)

Is it such a performance hit to have a large codec size even if the training data contains only a subset of those characters?

Implementing some form of "pre-loading" of e.g. full Unicode code pages instead of building the codec from the training set (as @amitdo suggests) is doable but I'm at a loss on the consequences wrt performance and network consistency. If the number and frequency of new char is small (e.g. a few new variants of letters), it will take a long time to accurately predict them, but it seems plausible. If it's a completely independent training set (like extending a Japanese model with Chinese training data), wouldn't that effectively require un-learning the old model and creating a new one?

Also, enabling such pre-loading would require retraining from scratch with the extended codec which can be very time-consuming, depending on the actual number of chars in the training set:

i am running a training over chinese character for 5 months ,iteration times is 700000
error rate still above 3.0 #81 (comment) @wanghaisheng

I am trying 2492-char subset. it seems to take several weeks (hidden=200, this time)
(NO nhidden = 200 seems to be hopeless, he/she seems to learn one char by forgetting another)
Now trying 3700 chars( little bigger tesseract jp-dataset ) with nhidden = 800 and nhidden =1200.
Unless my PC broke, I will see the result next spring. #49 @isaomatsunami

The issue is mostly with Chinese and Japanese.

Training both Chinese and Japanese in the same model is not a good idea.

Chinese has so many characters, we often train commonly used ones.
But sometimes that's not enough. We want to add some uncommon characters, So the problem happens.
If we retrain from scratch, that will take a long long time without a doubt. I think that's the same story with Japanese training.

I think your external codec file solution is good. We can prepare some codec for future use. @amitdo

we often came into multi-lingual document such as english-chinese, japanese-chinese.these characters are both valuable to use case .

Resizing the output layer of the network after training is generally not possible, although it would be possible to precreate unused nodes and making up codec entries for these afterwards. On the other hand this is a less than smart idea as the performance impact is rather high even for rather small scripts and their combination, e.g. Greek and Latin (codec size <300).

IMHO just retrain your models and invest some time to streamline the process. It's something you should be doing anyway and is quite a bit more straightforward than trying to repurpose already existing models.

Finally with Unihan it is actually quite a neat idea to train combined CJK models as it shouldn't increase output layer size for the vast majority of glyphs in either Hanzi scripts. On the other hand finding a network configuration that works for this multi-font model may take some hyperparameter exploration.

@striversist, I decided not to implement what I suggested before. It seems not to be such a
good idea after all. Sorry.