mvoelk/ssd_detectors

Endcode-Decode problem

trungpham2606 opened this issue · 12 comments

It's such a great implementation. But when I tried to visualize the ground truth images after training 1 epoch in Seglink ( i saw that some ground truth bounding boxes didnt fit the text although i have tested them all before training).

Visualize method: with normalized coordinates of bounding box, I first encoded them, then decoded (used sl_utils). After that, i drew the images.

But that problem doesnt appear in all images but some. So I wonder maybe there were some bug in either encode or decode part. Can you spend some time having a look on it ?

If you need to know anything in detail, just let me know.

Which data set you use? Can you provide one of these samples as well as a piece of code?

Okay, if you do something like

plt.imshow(images[i])
egt = prior_util.encode(data[i])
prior_util.plot_gt()
prior_util.plot_results(prior_util.decode(egt), color='r')

you may observe the following behavior

index

I confirm, this is a major issue with the segment width in the implementation and in the SegLink approach in general. When I wrote the code, there was no reference implementation available and I was not quite sure how to handle the segment width properly.

Let's look at Figure 5 (3) in the SegLink paper. There are exactly two cases that can occur on the left side of the shown segment. In the first case, another prior box is assigned to the word bounding box and the ground truth width of the corresponding segment is defined by means of the intersection between the prior and the word bounding box. This is also done in the implementation of the SegLink authors. The second case is when no further prior box can be assigned to the word bounding box and the decoded bounding box shrinks. Hence, the ground truth width of a segment is always less or equal to the width of the prior box.

In my implementation, I found that only the second case is a problem when the cropped bounding box is passed to the recognition stage. For that reason, I decided to added some padding to the resulting bounding box.

A pragmatic fix could it be, to allow the left and right most segment to have a width larger than the width of the prior box and then consider only the width of these segments in the loss function.

In case of your dataset, I'm assuming that the aspect ratio is not too large and the text is aligned almost horizontally. You probably may get better results with TextBoxes++ or even with TextBoxes.

@trungpham2606 I spent some time and took a closer look on the issue. It turned out, that there is indeed a issue with the decoding as described in the SegLink paper. In Algorithm 1, step 6 makes only sense if x_p and x_q are on the left and right edge of the bounding box and step 8 makes only sense if x_p and x_q are on the centers of the rightmost and leftmost segment.

I have changed the decoding method to fix this issue and updated my previous comment to avoid confusion. The encoding works as described in the paper, but the issue I mentioned still remains.

The example from above now looks like this:
index2

The modified decoding slightly increased the f-measure of the SegLink model from 0.868 to 0.869.

Thank You!

@mvoelk thank you so much for your support. I will apply your change to my dataset and show you my results then.

@mvoelk I just tested your new decode script. The results look better than before, there're still some images that ground truth bounding boxes didnt fit the text though. But the results are way better, at least for me case.
Thank you so much for your help. If you figure out anything else to improve or fix decode part completely, just lets me know.

F-measure of SegLink with DenseNet and Focal Loss increased from 0.922 to 0.932.

Oh nice, can you provide the parameters you chose for training with Focal Loss :-? I used to set them as your default but the loss was way worse than normal Loss.
Thanks in advance!

@trungpham2606 I'm not sure if the default values in sl_training.py are correct. Can you try lambda_segments=1.0, lambda_offsets=1.0, lambda_links=1.0 and report whether the scale is roughly the same as in the log file I provieded with the model? Which f-measuer do you get on segments?

@mvoelk Actually I tried on my dataset ( the images which I showed you above ). I just normally observed that the Focal lost initialized at 10000 or even more than that, then slowing down but not much.
I will tried with your suggestion and show you the results as soon as possible.

I usually divided the loss terms by the number of instances. In SegLinkFocalLoss I commented this normalization out. You should get the old behavior if you uncomment the necessary lines.