Question about JoinAP

Question

Question about JoinAP

Opened this issue 2 years ago · 3 comments

kalvinchang commented 2 years ago

Hi. I apologize if this is not the right place to post a question, but I wasn't sure where to post it.

I have a few questions about JoinAP from Zhu et al 2021 (https://arxiv.org/pdf/2107.05038.pdf).

Where does the model obtain the phones in Figure 2? Are the phones obtained from the ground truth transcriptions or are they first predicted by the acoustic model?
By top-down, are you referring to breaking phones down into articulatory phonetic features using panphon?
During test time, are the phonetic transcriptions generated by Phonetisaurus also fed into the acoustic model as phone sequences? If not, where do the phones come from?

Thank you in advance!!

Answer 1 · 2022-06-04T12:43:43.000Z

In Figure 2(a), phone denotes hypothesized phone. At frame t, for each phone i, we use Eq. (2) or (3) to calculate the logits, which are then used to calculate the phone (posterior) probabilities for CTC-CRF or CTC. This is explained in the paragraph under Eq. (7).
By top-down, are you referring to breaking phones down into articulatory phonetic features using panphon?

Partly Yes.
breaking phones into articulatory features, and then calculating the phone embeddings based on the articulatory features.

In testing, we use WFST-based decoding based on the WFST composed from T, L, and G. L is the WFST that maps a word into a phone sequence. T and G represent the CTC topology and n-gram language model, respectively.

Answer 2 · 2022-06-06T05:57:52.000Z

I see. Thank you!

For posterity, I will explain my understanding:

When you're calculating the logits for the phones, you cannot know a priori what the actual phone for time t is. Instead, you calculate logits for each possible phone (and feed them into a softmax) to get a probability distribution. The logit comes from the dot product of

e_i: the phone embedding for ith phone based on either linear or nonlinear transformation of the articulatory feature embedding (called the phonological embedding in the paper)
h_t: the acoustic embedding learned by the acoustic DNN

Figure 2 in the paper shows what happens for the ith phone in the phone inventory, but the model repeats what goes in Figure 2 for each possible phone in the inventory.

It is only during decoding that you pick a phone for time t using the posterior phone probability distribution.

Answer 3 · 2022-07-07T16:37:07.000Z

It is only during decoding that you pick a phone for time t using the posterior phone probability distribution.

Yes if you use the best path decoding.