Words in char list
hollarob opened this issue · 5 comments
Hi, thanks for this contribution!
I'm working on a alignment on phonetic level for a fine tuned Wav2vec2 model.
The char list in my case consist of two ascii chars each. I have seen that in the prepare_text
function, for every occurrence in the text (in my case for example: ['ab', 'ca', ...]
), each character is checked against the char list. But the char list consists of these char pairs, so its not appended to the ground truth.
Am I missing something here? Do I have to adjust the code to my needs? Maybe an additional attribute in the config would help.
Okay I figured out that I probably should use the function prepare_tokenized_text
with a list of strings of space separated tokens like this ['ab ca bg']
.
But now I don't know, how to calculate the utterance in ms, as I get only 2 utterance begin indices for an x-long list of tokens Though I have more than 2 timings. determine_utterance_segments
is than only returning one segment which is very off. Any help is appreciated!
I am not yet sure about your exact application, but keeping a properly translated list ['ab', 'ca', ...]
with prepare_tokenized_text
, i.e. interpreting each phoneme as a separate utterance, that should already give you an estimation of starting and ending times of these phonemes.
An accuracy: Please note that CTC is based on a monotonic alignment assuption, whoever, due to varying distributions of the phonetic information within a word, that even may change the ordering of phonemes in practical applications. While the CTC segmentation algorithm was not designed specifically for such alignments, in my experience, the estimated aligments are mostly accurate.
If your tokens were correctly translated into a long single-utterance token list ['ab ca bg']
and then aligned, you can read individual token timings from the timings
array, which is an output of the ctc_segmentation
function.
I appreciate your help! So this is what I'm currently doing:
def align_tokens(
vocab: dict,
blank_id: int,
audio_array: np.ndarray,
decoding_dict: dict,
probs: np.ndarray,
pred_list: list[str],
) -> list[dict]:
# pred_list = ['[SIL]', 'be', 'au', 'ac', 'aw', 'az', 'by', 'bw', 'ap', 'be', 'bl', 'bc', 'bb', 'ce', 'bb', 'ay', 'cl', '[SIL]', 'bf', 'bl', 'bt', 'ah', '[SIL]']
char_list = [x for x in vocab.keys()]
config = ctc_segmentation.CtcSegmentationParameters(
char_list=char_list,
blank=blank_id,
)
config.update_excluded_characters()
config.index_duration = (
audio_array.shape[0] / probs.shape[0] / CONFIG["model"]["samplerate"]
)
ground_truth_mat, utt_begin_indices = ctc_segmentation.prepare_tokenized_text(
config=config,
text=pred_list,
)
timings, char_probs, _ = ctc_segmentation.ctc_segmentation(
config=config,
lpz=probs,
ground_truth=ground_truth_mat,
)
segments = ctc_segmentation.determine_utterance_segments(
config=config,
utt_begin_indices=utt_begin_indices,
char_probs=char_probs,
timings=timings,
text=pred_list,
)
segments = [
{"text": decoding_dict[w], "start": p[0], "end": p[1], "conf": p[2]}
for w, p in zip(pred_list, segments)
]
return segments
With that I get the tokens aligned, though the alignments are very wrong and also the confidence score is negative for every token. The model I finetuned has a quite accuracy and the tokens are predicted by the model itself.
By inspection, should be OK. The confidence score is in log space and it should be negative. I would directly use timings
to obtain the token timings, that means, if token duration is not important. I can recommend to check the model output, especially at the first andd last token. Also, I recommend to plot your CTC output probs
. This helps to visualize activations in shorter audio files and to check whether the alignment result is correct.
Thanks for your help and time! I followed your advice and plot the CTC probs
. The CTC output is weirdly shiftet and compressed and does not align with the input audio. I'll have to look more into that, did not find out why yet. When I found out I will confirm if it works.