I has implement Tensorflow binding, but gradient maye error.
Closed this issue · 16 comments
I have modifies warp-ctc
Tensorflow binding and CAT
pytorch binding, besides I have removed costs_beta
(maybe useless).
def ctc_crf_loss(logits, labels, input_lengths,
blank_label=0, lamb=0.1):
'''Computes the CTC-CRF loss between a sequence of logits and a
ground truth labeling.
Args:
logits: A 3-D Tensor of floats. The dimensions
should be (t, n, a), where t is the time index, n
is the minibatch index, and a indexes over
logits for each symbol in the alphabet.
labels: An int32 SparseTensor. labels.indices[i, :] == [b, t] means
labels.values[i] stores the id for (batch b, time t).
labels.values[i] must take on values in [0, num_labels).
input_lengths: A 1-D Tensor of ints, the number of time steps
for each sequence in the minibatch.
blank_label: int, the label value/index that the CTC
calculation should use as the blank label.
lamb: float, A weight α for CTC Loss.
Combined with the CRF loss to help convergence.
Returns:
1-D float Tensor, the cost of each example in the minibatch
(as negative log probabilities).
* This class performs the softmax operation internally.
* The label reserved for the blank symbol should be label 0.
'''
# The input of the warp-ctc is modified to be the log-softmax output of the bottom neural network.
activations = tf.nn.log_softmax(logits) # (t, n, a)
activations_ = tf.transpose(activations, (1, 0, 2)) # (n, t, a)
loss, _, _, costs_alpha = _ctc_crf.ctc_crf_loss(
activations, activations_, labels.indices, labels.values,
input_lengths, blank_label, lamb) # costs, gradients, grad_net, costs_alpha
return (costs_alpha - (1 + lamb) * loss) # (n,)
@ops.RegisterGradient("CtcCrfLoss")
def _CTCLossGrad(op, grad_loss, a, b, c):
"""The derivative provided by CTC-CRF Loss.
Args:
op: the CtcCrfLoss op.
grad_loss: The backprop for cost.
Returns:
The CTC-CRF Loss gradient.
"""
lamb = op.get_attr('lamb')
grad_ctc = op.outputs[1] # (t, n, a)
grad_den = tf.transpose(op.outputs[2], (1, 0, 2)) # (t, n, a)
grad = grad_den - (1 + lamb) * grad_ctc # (t, n, a)
# average with batch size.
grad /= tf.cast(_get_dim(grad, 1), dtype=tf.float32) # (t, n, a)
# Return gradient for inputs and None for
# activations_, labels_indices, labels_values and sequence_length.
return [_BroadcastMul(grad_loss, grad), None, None, None, None]
# return [_BroadcastMul(grad_loss, op.outputs[1]), None, None, None, None]
I can provide all the codes if necessary, but my result is error because TER is over 100%.
In my option:
CRF_Loss
= -(log_prob_ctc-log_prob_den)
+lamb*(-log_prob_ctc)
= log_ prob_den
- (1+lamb)*log_prob_ctc
, so the ctc_crf_base.gpu_ctc
, ctc_crf_base.gpu_den
output is log_prob_ctc
and log_prob_den
, not the mean of loss.
I guess the gradient is -(grad_den - (1 + lamb) * grad_ctc)
. Please correct me if wrong.
In my option:
CRF_Loss
=-(log_prob_ctc-log_prob_den)
+lamb*(-log_prob_ctc)
=log_ prob_den
-(1+lamb)*log_prob_ctc
, so thectc_crf_base.gpu_ctc
,ctc_crf_base.gpu_den
output islog_prob_ctc
andlog_prob_den
, not the mean of loss.
I guess the gradient is-(grad_den - (1 + lamb) * grad_ctc)
. Please correct me if wrong.
I think the gradient is grad_den - (1 + lamb) * grad_ctc.
For AISHELL task. I try to replace real_loss = (partial_loss - weight)
with Phoneme Error Rate (use CTC greedy search and edit distance to calculate) each epoch for evaluating. Is this a problem?
For AISHELL task. I try to replace
real_loss = (partial_loss - weight)
with Phoneme Error Rate (use CTC greedy search and edit distance to calculate) each epoch for evaluating. Is this a problem?
Maybe you can have a try. The learning rate decay procedure is controled by the evaluation loss (in train.py,we call it cv loss) so the final results may differ.
In my option:
CRF_Loss
=-(log_prob_ctc-log_prob_den)
+lamb*(-log_prob_ctc)
=log_ prob_den
-(1+lamb)*log_prob_ctc
, so thectc_crf_base.gpu_ctc
,ctc_crf_base.gpu_den
output islog_prob_ctc
andlog_prob_den
, not the mean of loss.
I guess the gradient is-(grad_den - (1 + lamb) * grad_ctc)
. Please correct me if wrong.I think the gradient is
grad_den - (1 + lamb) * grad_ctc.
I have use egs/aishell run scrpit. And the exp result is all corrected. But my all task is in Tensorflow. So I implement a Tensorflow API for CAT.
But I find gpu_ctc part is corrected and gpu_den result is wrong, and gpu_den result is different every time. So I doubt the implement with Tensorflow API memory manager or GPU caculation has error.
I knew some code in GPU_DEN is written by you. So this part of the implementation is very familiar to you. Can you give me some advices. Repo is here: CAT-Tensorflow
Thanks for you attention. I am not familiar with tensorflow binding and I am a little busy recently. You can check whether costs_alpha
equals to costs_beta
. I may take a look this weekend if you still do not solve it.
Thanks for you attention. I am not familiar with tensorflow binding and I am a little busy recently. You can check whether
costs_alpha
equals tocosts_beta
. I may take a look this weekend if you still do not solve it.
I understand. Thank you so much for your suggestion.
costs_beta
So the costs_beta
is for verify the effectiveness of the algorithm? I deleted costs_beta
that I thought it was useless.
costs_beta
So the
costs_beta
is for verify the effectiveness of the algorithm? I deletedcosts_beta
that I thought it was useless.
costs_beta
should be equal to costs_alpha
in theory, so it is actually useless if everything is all
right。costs_beta
can be used to to check the correctness of the denominator computation. If the
difference between costs_beta
and costs_alpha
is large, there might be something wrong with the denominator computation.
In my task, I allocate a mini-batch equally according to the number of GPUs, and then do parallel calculations. For example, batch_size=32
and GPU setting is [0,1,2,3]
, each GPU will get 8 samples. Is this not very compatible with GPU_DEN
.
I am not familiar with tensorflow's multi-GPU training strategy. GPU_DEN can do multi-GPU calculation using pytorch DataParallel (although the implementation is a little ugly). Are the results right if you use a single GPU?
I am not familiar with tensorflow's multi-GPU training strategy. GPU_DEN can do multi-GPU calculation using pytorch DataParallel (although the implementation is a little ugly). Are the results right if you use a single GPU?
Thanks for your reply, dose GPU_DEN has a CPU vesion? I think tensorflow is hard to specifies that the sensor allocates memory on the GPU. So the result is not expected.
GPU_DEN only has a gpu version. Calculating the gradient of the denominator graph on CPU is very slow.
costs_beta
So the
costs_beta
is for verify the effectiveness of the algorithm? I deletedcosts_beta
that I thought it was useless.
costs_beta
should be equal tocosts_alpha
in theory, so it is actually useless if everything is all
right。costs_beta
can be used to to check the correctness of the denominator computation. If the
difference betweencosts_beta
andcosts_alpha
is large, there might be something wrong with the denominator computation.
Is it possibility caused by GPU asynchronous call If cost_alph
not equal to cost_beta
?
Is it possibility caused by GPU asynchronous call If
cost_alph
not equal tocost_beta
?
Maybe. But I think in the code I have synchronized the GPU threads when necessary.
Is it possibility caused by GPU asynchronous call If
cost_alph
not equal tocost_beta
?Maybe. But I think in the code I have synchronized the GPU threads when necessary.
Thank you for your help.
The previous problem is that my test logits dimension is less than 218 which AISHELL phoneme set is 217+1(blank). Tensorlow binding result is consistent with the results of PyTorch binding now.