thu-spmi/CAT

I has implement Tensorflow binding, but gradient maye error.

Closed this issue · 16 comments

I have modifies warp-ctc Tensorflow binding and CAT pytorch binding, besides I have removed costs_beta (maybe useless).

def ctc_crf_loss(logits, labels, input_lengths,
                 blank_label=0, lamb=0.1):
  '''Computes the CTC-CRF loss between a sequence of logits and a
  ground truth labeling.

  Args:
      logits: A 3-D Tensor of floats. The dimensions
                   should be (t, n, a), where t is the time index, n
                   is the minibatch index, and a indexes over
                   logits for each symbol in the alphabet.

      labels: An int32 SparseTensor. labels.indices[i, :] == [b, t] means 
              labels.values[i] stores the id for (batch b, time t). 
              labels.values[i] must take on values in [0, num_labels).

      input_lengths: A 1-D Tensor of ints, the number of time steps
                     for each sequence in the minibatch.

      blank_label: int, the label value/index that the CTC
                   calculation should use as the blank label.

      lamb: float, A weight α for CTC Loss. 
                  Combined with the CRF loss to help convergence.

  Returns:
      1-D float Tensor, the cost of each example in the minibatch
      (as negative log probabilities).

  * This class performs the softmax operation internally.

  * The label reserved for the blank symbol should be label 0.

  '''
  # The input of the warp-ctc is modified to be the log-softmax output of the bottom neural network.
  activations = tf.nn.log_softmax(logits) # (t, n, a)
  activations_ = tf.transpose(activations, (1, 0, 2)) # (n, t, a)
  loss, _, _, costs_alpha = _ctc_crf.ctc_crf_loss(
      activations, activations_, labels.indices, labels.values,
      input_lengths, blank_label, lamb) # costs, gradients, grad_net, costs_alpha

  return (costs_alpha - (1 + lamb) * loss)  # (n,)


@ops.RegisterGradient("CtcCrfLoss")
def _CTCLossGrad(op, grad_loss, a, b, c):
  """The derivative provided by CTC-CRF Loss.

  Args:
     op: the CtcCrfLoss op.
     grad_loss: The backprop for cost.

  Returns:
     The CTC-CRF Loss gradient.
  """
  lamb = op.get_attr('lamb')
  grad_ctc = op.outputs[1] # (t, n, a)
  grad_den = tf.transpose(op.outputs[2], (1, 0, 2)) # (t, n, a)
  grad = grad_den - (1 + lamb) * grad_ctc # (t, n, a)
  # average with batch size.
  grad /= tf.cast(_get_dim(grad, 1), dtype=tf.float32) # (t, n, a)

  # Return gradient for inputs and None for
  # activations_, labels_indices, labels_values and sequence_length.
  return [_BroadcastMul(grad_loss, grad), None, None, None, None]
  # return [_BroadcastMul(grad_loss, op.outputs[1]), None, None, None, None]

I can provide all the codes if necessary, but my result is error because TER is over 100%.

In my option:
CRF_Loss = -(log_prob_ctc-log_prob_den)+lamb*(-log_prob_ctc) = log_ prob_den - (1+lamb)*log_prob_ctc, so the ctc_crf_base.gpu_ctc, ctc_crf_base.gpu_den output is log_prob_ctc and log_prob_den, not the mean of loss.
I guess the gradient is -(grad_den - (1 + lamb) * grad_ctc). Please correct me if wrong.

hyx16 commented

In my option:
CRF_Loss = -(log_prob_ctc-log_prob_den)+lamb*(-log_prob_ctc) = log_ prob_den - (1+lamb)*log_prob_ctc, so the ctc_crf_base.gpu_ctc, ctc_crf_base.gpu_den output is log_prob_ctc and log_prob_den, not the mean of loss.
I guess the gradient is -(grad_den - (1 + lamb) * grad_ctc). Please correct me if wrong.

I think the gradient is grad_den - (1 + lamb) * grad_ctc.

For AISHELL task. I try to replace real_loss = (partial_loss - weight) with Phoneme Error Rate (use CTC greedy search and edit distance to calculate) each epoch for evaluating. Is this a problem?

aky15 commented

For AISHELL task. I try to replace real_loss = (partial_loss - weight) with Phoneme Error Rate (use CTC greedy search and edit distance to calculate) each epoch for evaluating. Is this a problem?

Maybe you can have a try. The learning rate decay procedure is controled by the evaluation loss (in train.py,we call it cv loss) so the final results may differ.

In my option:
CRF_Loss = -(log_prob_ctc-log_prob_den)+lamb*(-log_prob_ctc) = log_ prob_den - (1+lamb)*log_prob_ctc, so the ctc_crf_base.gpu_ctc, ctc_crf_base.gpu_den output is log_prob_ctc and log_prob_den, not the mean of loss.
I guess the gradient is -(grad_den - (1 + lamb) * grad_ctc). Please correct me if wrong.

I think the gradient is grad_den - (1 + lamb) * grad_ctc.

I have use egs/aishell run scrpit. And the exp result is all corrected. But my all task is in Tensorflow. So I implement a Tensorflow API for CAT.
But I find gpu_ctc part is corrected and gpu_den result is wrong, and gpu_den result is different every time. So I doubt the implement with Tensorflow API memory manager or GPU caculation has error.
I knew some code in GPU_DEN is written by you. So this part of the implementation is very familiar to you. Can you give me some advices. Repo is here: CAT-Tensorflow

hyx16 commented

Thanks for you attention. I am not familiar with tensorflow binding and I am a little busy recently. You can check whether costs_alpha equals to costs_beta . I may take a look this weekend if you still do not solve it.

Thanks for you attention. I am not familiar with tensorflow binding and I am a little busy recently. You can check whether costs_alpha equals to costs_beta . I may take a look this weekend if you still do not solve it.

I understand. Thank you so much for your suggestion.

costs_beta

So the costs_beta is for verify the effectiveness of the algorithm? I deleted costs_beta that I thought it was useless.

hyx16 commented

costs_beta

So the costs_beta is for verify the effectiveness of the algorithm? I deleted costs_beta that I thought it was useless.

costs_beta should be equal to costs_alpha in theory, so it is actually useless if everything is all
right。costs_beta can be used to to check the correctness of the denominator computation. If the
difference between costs_beta and costs_alpha is large, there might be something wrong with the denominator computation.

In my task, I allocate a mini-batch equally according to the number of GPUs, and then do parallel calculations. For example, batch_size=32 and GPU setting is [0,1,2,3], each GPU will get 8 samples. Is this not very compatible with GPU_DEN.

hyx16 commented

I am not familiar with tensorflow's multi-GPU training strategy. GPU_DEN can do multi-GPU calculation using pytorch DataParallel (although the implementation is a little ugly). Are the results right if you use a single GPU?

I am not familiar with tensorflow's multi-GPU training strategy. GPU_DEN can do multi-GPU calculation using pytorch DataParallel (although the implementation is a little ugly). Are the results right if you use a single GPU?

Thanks for your reply, dose GPU_DEN has a CPU vesion? I think tensorflow is hard to specifies that the sensor allocates memory on the GPU. So the result is not expected.

hyx16 commented

GPU_DEN only has a gpu version. Calculating the gradient of the denominator graph on CPU is very slow.

costs_beta

So the costs_beta is for verify the effectiveness of the algorithm? I deleted costs_beta that I thought it was useless.

costs_beta should be equal to costs_alpha in theory, so it is actually useless if everything is all
right。costs_beta can be used to to check the correctness of the denominator computation. If the
difference between costs_beta and costs_alpha is large, there might be something wrong with the denominator computation.

Is it possibility caused by GPU asynchronous call If cost_alph not equal to cost_beta?

hyx16 commented

Is it possibility caused by GPU asynchronous call If cost_alph not equal to cost_beta?

Maybe. But I think in the code I have synchronized the GPU threads when necessary.

Is it possibility caused by GPU asynchronous call If cost_alph not equal to cost_beta?

Maybe. But I think in the code I have synchronized the GPU threads when necessary.

Thank you for your help.
The previous problem is that my test logits dimension is less than 218 which AISHELL phoneme set is 217+1(blank). Tensorlow binding result is consistent with the results of PyTorch binding now.