IAM dataset currently on 267 epoch with ~6.2% cer ...

Question

IAM dataset currently on 267 epoch with ~6.2% cer ...

miliadis opened this issue 7 years ago · 19 comments

I train IAM dataset following the README instructions https://github.com/jpuigcerver/Laia/tree/master/egs/iam . I am currently on ~267 epoch with ~6.2% cer on val set. Since, the readme file says that the 3.8% cer will be reached at 80 epoch, just wondering if there is any change that I am not aware of.

IAM dataset: 6176 training samples, 976 val samples

Answer 1 · 2017-12-01T14:25:32.000Z

Hi @miliadis

Thanks for pointing out the problem.

I just got a fresh clone of the current Laia version and I'm trying to reproduce the experiment to see in which point diverges from my previous run. I'll come back to you once I've found where the problem is.

Answer 2 · 2017-12-01T15:33:51.000Z

By the way, the README does not say that you should get 3.8% CER after 80 epochs. You'll achieve approximately that CER when the validation error stops improving for 80 epochs. In my case, the complete training took 514 epochs.

Anyhow, I did found a bug in the train_lstm1d.sh script: the dropout in the convolutional layers is not activated.

In any case, the CER that you are getting is very high. In my run I got <=6.2% CER on the validation set after 36 epochs. I'll keep looking into it.

I also uploaded my models for IAM and Rimes, and the logs for IAM.

Answer 3 · 2017-12-01T18:26:46.000Z

Hey @jpuigcerver

Got it, thanks for clarifying this about the 80 epochs. So, this is what I get with ~36 epochs:

I am training now with dropout=0.2 and we will see....but it's strange that you were getting ~6.2 without dropout at 36 epochs...

Answer 4 · 2017-12-01T18:35:12.000Z

There's definitely something else going on. The difference between using dropout for the convnets and not using it wasn't very big, so I don't think that's the difference. I'm now in a conference, but once I come back to Spain I'll look more closely into it.

Answer 5 · 2017-12-01T19:55:04.000Z

FYI, I just committed a change to use the correct dropout values in the CNNs: See commit c6bb8ab.

Answer 6 · 2017-12-02T14:19:25.000Z

I also see the same issue and got CER of 5.98% after 469 epochs (before your latest dropout commit). Not sure if it is related but the default preprocessing seems to be overly aggressive sometimes (on my system the preprocessed C04-110-00.png is completely missing the "a" character).

Answer 7 · 2017-12-02T17:46:53.000Z

I suspect that the differences are due to a change made in the imgtxtenh tool, on March 13th, 2017 (mauvilsa/imgtxtenh@5cca789). @mauvilsa changed the default units from "mm" to "pixels". I think I have located where the problem is, but I need to check it first.

Since I still have the data processed with the previous version, I am re-training the model from scratch with the current version of Laia and the data processed with the old version of imgtxtenh tool.

I'll come back to you as soon as I get some update.

Answer 8 · 2017-12-02T18:01:51.000Z

@miliadis @bdotgradb Could you please modify this line from the prepare_images.sh script:
https://github.com/jpuigcerver/Laia/blob/master/egs/iam/steps/prepare_images.sh#L44

And simply remove the "-u mm" part. You will need to process the images again and start the training from scratch. But I suspect that it will solve the problem, I am trying it myself.

Answer 9 · 2017-12-02T18:08:59.000Z

Thanks, I will try this and let you know...

Answer 10 · 2017-12-05T00:50:57.000Z

Results of my re-run:
Finished training after 430 epochs. According to "valid_cer" criterion, epoch 350 was the best: duration = 21s ; batches = 61 ; min./max./avg. chunks/batch = 1/1/1.0 ; loss = 0.031728 ; cer = 3.80% ; del = 0.65% ; ins = 0.52% ; sub = 2.64% ; cer_ci = [ 3.52%, 4.08%] ; ci_alpha = 5.000%

and results of decoding:
%CER lines va: 3.90
%WER lines va: 13.52
%CER forms va: 3.78
%WER forms va: 13.50
%CER lines te: 5.77
%WER lines te: 18.17
%CER forms te: 5.62
%WER forms te: 18.13

Why is the validation CER in the training report different from the final decode (3.80% vs 3.90%)?

Answer 11 · 2017-12-05T07:38:36.000Z

Hi,

The results are different because the output of decoding is slightly depending on the batch.

During training, we reshuffle the validation set on each epoch. While during decoding the original order of the examples, according to your input file, is used.

Because input images have different sizes, but all are zero padded to have the same size as the largest one, the decoding results may be slightly different based on that.

We always report the results on the separate evaluation step, which processes the files in the original (alphabetic) order.

Answer 12 · 2017-12-05T11:06:10.000Z

@miliadis Did you manage to reproduce the results as @bdotgradb did? If so, I'll close this issue. BTW, I updated a few scripts recently to fix some other bugs that I found.

Answer 13 · 2017-12-06T03:52:12.000Z

@jpuigcerver my training is not done yet, but definitely I see the improvement (epoch 250 -> cer 4.30%)

Answer 14 · 2017-12-06T17:11:58.000Z

@jpuigcerver

Final results: ../../laia-train-ctc:387: Epoch 292, last epoch with a significant improvement on "valid_cer" criterion was 212. Triggering early stop!
[2017-12-05 23:32:08 INFO] ../../laia-train-ctc:401: Finished training after 292 epochs. According to "valid_cer" criterion, epoch 212 was the best: duration = 16s ; batches = 61 ; min./max./avg. chunks/batch = 1/1/1.0 ; loss = 0.029689 ; cer = 4.13% ; del = 0.72% ; ins = 0.60% ; sub = 2.80% ; cer_ci = [ 3.82%, 4.42%] ; ci_alpha = 5.000%

my training finished at 292 epoch and final CER result is 4.13%. This is not exactly 3.80%, but close...so, is this difference acceptable?

Answer 15 · 2017-12-06T18:58:01.000Z

@miliadis I would expect a CER closer to what @bdotgradb and I obtained. Could you please upload somewhere your training log?

Answer 16 · 2017-12-08T05:56:04.000Z

@jpuigcerver I am training again after your most recent changes

Answer 17 · 2017-12-09T07:48:40.000Z

I have committed some changes to the imgtxtenh tool (mauvilsa/imgtxtenh#3). If the same parameters as originally in the script are used (imgtxtenh -u mm -d 118.110) then exactly the same processing as before should be done.

Answer 18 · 2017-12-13T02:13:48.000Z

thanks @jpuigcerver and @mauvilsa, was able to reproduce IAM results

Answer 19 · 2017-12-13T02:38:46.000Z

Closing the issue now.