I have a question of gamma value in paper, In fact, if gamma less than one, if p values is bigger which means this is easy example, and (1-p)**gamma will result in a bigger weight for ctc loss， it is different from idea of focal loss. As in paper, easy example will get bigger weight and hard example get less weight

Question

I have a question of gamma value in paper, In fact, if gamma less than one, if p values is bigger which means this is easy example, and (1-p)**gamma will result in a bigger weight for ctc loss， it is different from idea of focal loss. As in paper, easy example will get bigger weight and hard example get less weight

cjt222 opened this issue 4 years ago · 1 comments

In fact, if gamma less than one, if p values is bigger which means this is easy example, and (1-p)gamma will result a bigger weight for ctc loss， it is different from idea of focal loss

Answer 1 · 2020-05-22T04:15:12.000Z

No the idea proposed in the paper is correct. It is the relative loss which matters. That is if it is a hard example(p<0.5) it should contribute more to the loss, and if it is a easy example(p>0.5) it should contribute less to the loss.
For example:
p=0.3 for hard example
p=0.7 for easy example
gamma=0.5
hard_loss=(1-p)=1-0.3=0.7 sqrt(0.7)=0.83
easy_loss=(1-p)=1-0.7=0.3 sqrt(0.3)=0.54
hard_loss>easy_loss
Another example:
p=0.3 for hard example
p=0.7 for easy example
gamma=2.0
hard_loss=(1-p)=1-0.3=0.7 sq(0.7)=0.49
easy_loss=(1-p)=1-0.7=0.3 sq(0.3)=0.09
hard_loss>easy_loss

Here gamma controls the relative difference between the losses. If the dataset set is highly skewed choose higher value of gamma.
(Check this out. This shows how different gamma values will impact loss during training)

In sequence tasks like OCR/OMR we have to distribute the p over many symbols, but the sum of probabilities remains 1 (sum(p)=1). Therefore we select lower gamma value(but greater than 0) as long as we maintain the relative loss.

Inshort, gamma(>0) controls the relative loss we want to impose on easy and hard samples.

if a>b
then a^n > b^n holds true
as long as n>0.

Image Source: Focal Loss for Dense Object Detection. (https://arxiv.org/pdf/1708.02002.pdf)