mechanicalsea/lighthubert

Question about the two-stage training

pyf98 opened this issue · 3 comments

pyf98 commented

Hi,

Thanks for your great work! I have some questions about the two-stage training. I'd appreciate it if you could share more details.

  1. In Stage 2 - Once-for-All Training, which model is used as the teacher? Is it the original HuBERT base, or the distilled model from Stage 1?
  2. How is the small supernet initialized in Stage 2? I guess it is also initialized with the distilled model from Stage 1, but their sizes are different?
  3. In the ablation study (Table 5), how is the supernet initialized in Stage 2 when Stage 1 is skipped? Is it initialized with the original HuBERT base or is it trained from scratch?

Thank you for your time!

Hi, @pyf98
Thanks for your attention. We're glad to explain these questions as you mentioned.

  1. In Stage 2 - Once-for-All Training, which model is used as the teacher? Is it the original HuBERT base, or the distilled model from Stage 1?
    Answer: The original HuBERT base serve as the teacher model in Stage 2 - Once-for-All Training as the same as that of Stage 1.
  2. How is the small supernet initialized in Stage 2? I guess it is also initialized with the distilled model from Stage 1, but their sizes are different?
    Answer: Yes, the the small supernet is initialized with the distilled model from Stage 1. The small network is nested in the distilled model and its weights are part of the distilled model. How the small network nested in large networks can be explained in other once-for-all network, e.g., Figure 2, 3, and 4 in EfficientTDNN.
  3. In the ablation study (Table 5), how is the supernet initialized in Stage 2 when Stage 1 is skipped? Is it initialized with the original HuBERT base or is it trained from scratch?
    Answer: The supernet is trained from scratch (i.e., initialized with random weights) in Stage 2 when Stage 1 is skipped. The case is perform to study how the Stage 1 works.

Have a nice day.

pyf98 commented

Thank you for the reply.

How many GPU hours are needed for the two training stages? Just an estimate would be fine.

Congratulations again on this nice piece of work!

Hi @pyf98
The details of GPU hours are summarized as follows.

  1. stage 1 costs ~62 hours and 32 V100 within 400k updates.
  2. stage 2 with small supernet costs ~19hours and 8 V100 within 200k updates while stage 2 with base supernet costs ~18hours.

Note that the max tokens we used are 1900 K in both of stage 1 and 2, as the same as that of data2vec.