Training takes too long!!
andy-soft opened this issue · 20 comments
Hello, I was wondering what are the training times for the demonstrations.
I just tried the english seq labeler, and it took 1 hour to process 10% of the corpus! (is this normal?)
It's known Deep Learning is CPU hungry, I have only 2 cores and 8GB RAM (sorry)
¿do I need to change the PC) or acquire a CUDA core to help computing?
¿Is there a way to stop learning manually, or programmatically after reaching certain error rate?
I am wondering if you ever tried sequence labelling on highly inflectional languages (like Spanish) which has lots of inflectional power (complexity) and the words as a whole string are useless, the vocabulary explodes into >300M words! and the "examples"found on text begins to be too sparse, even with negative sampling you never get certain combinations, because most verbs have over 200 versions of itself (inflections), including time-tense, person, gender, plurality, mode, etc. So there is need to train on higher level features, but not losing the "semantic" sense. ¿do you think this could be possible, like decomposing the words (by means of controlled independent lemmatization) into parts/chunks (prefix, root, suffix, as well as modal information and semantic features of the parts,) My intuition is that this might lower the training and may be better the generalization power with less extensive corpus. Like capturing higher level syntax rules, and by the way generating semantic content constraints (may be even some common sense)...
It's just a question, on theory!
Hi @andy-soft,
For your labeling task, how many categories do you want to label ? Could you please share the configuration file your are using with me ? Then I will estimate if current performance is reasonable. Currently, RNNSharp doesn't support GPU training. It supports CPU training with SIMD instruction only, so you need to have a powerful CPU with new SIMD instruction set, such as AVX, AVX2 and so on.
I did use RNNSharp for sequence label tasks on inflectional languages, such as English, such as pos-tag, named entity and so on. Usually, the labeling categories is no more than 50. If labeling categories is too much, it will definitely affect performances, and you should optimize them, such as splitting them to a few of basic units for labeling. If it's really hard to reduce the number of them, you could use SampledSoftmax as output layer type. For each token, It randomly samples some categories plus categories on current labeling sentence for training, instead of the entire categories set.
It's really appreciated if you could make contribution for RNNSharp. :)
For word2vec, you can try my version: https://github.com/zhongkaifu/Txt2Vec It has higher performance than original word2vec and supports incremental training.
For "the problem is the many labels of each word, the variability is huge, more
than 900 different POS labels, (EAGLES 2 version)", could you please make a specified example about it ? Sorry that I don't understand about it.
Hi Andrés,
Thanks for your explanation in details. It's really helpful.
For your task, to improve performance and reduce the number of output categories, you could try sub-word level segmentation and labeling or character level segmentation and labeling. As the example you mentioned in above "hiperrecontrabuenísimo", if you have sub-word dictionary for training, you could build training corpus likes:
hiper \t S_Aug1
recontra \t S_Aug2
buen \t S_CorePart
ísimo \t S_Aug3
So, Label "Aug1Aug2CorePartAug3" is split into four basic tags. Or you could try character level labeling, such as
h \t B_Aug1
I \t M_Aug1
p \t M_Aug1
e \t M_Aug1
r \t E_Aug1
By this way, it will significantly reduce the number of output categories.
Thanks
Zhongkai Fu
In addition, did you try the latest RNNSharp code (check out from master branch) ? It's much faster than the released version, since I have not updated release package yet.
According RNN output lines, you are still using older RNNSharp, please sync the latest source code (not released demo package, since I have not updated it yet), build it and train your model.
It's okay you can send training example, configuration file and command line you ran to me.
First of all, your CPU has only two cores, this is the main reason why training is slowly.
Secondly, I don't know if your CPU support AVX and AVX2 instructions which is for SIMD to speed up training. You could show a few of first log lines with me, and I will take a look.
Finally, you could set TFEATURE_CONTEXT=0 to reduce the number of sparse features to speed up training.
Hi Andrés
It's really appreciated if you would like to contribute RNNSharp project. :)
I cannot get your inline image for CPU G2020. According information at http://www.cpu-world.com/CPUs/Pentium_Dual-Core/Intel-Pentium%20G2020.html, it seems this CPU doesn't support AVX and AVX2 instructions, so RNNSharp cannot emit SIMD instruction to speed up.
I'm using System.Vectors which is a component of .NET core to emit SIMD instruction (AVX and AVX2) for RNNSharp.
If that AMD CPU supports these AVX instructions, RNNSharp can leverage them as well.
Hi there, I just got a CPU with 16 cores and 128Gbytes of RAM. Ready to train hard!!
Cool!
I recently introduced MKL into Seq2SeqSharp and got a significantly improvement on performance, if you like, you could try it in RNNSharp.
I just put to train the sample of English SeqClassif (NER) from your sample 143Mb flat text file, 2.2M words.
I got a 32 core Xeon 3500 server, with 128 Gb Ram, and...
it took >24 hours to reach a mere 0.89% token error, 8.89% seq error. (About 40% of total training time, then I aborted it) I am scared of the unusual time to train those sets....
The binary model file is 1.8Gb long !!
¿Are those normal training times, and model sizes.. ?
¿or should I go and purchase a CUDA multi core and use another LSTM library?