Question about the two-stage training

Question

pyf98 opened this issue 2 years ago · 3 comments

Hi,

Thanks for your great work! I have some questions about the two-stage training. I'd appreciate it if you could share more details.

In Stage 2 - Once-for-All Training, which model is used as the teacher? Is it the original HuBERT base, or the distilled model from Stage 1?
How is the small supernet initialized in Stage 2? I guess it is also initialized with the distilled model from Stage 1, but their sizes are different?
In the ablation study (Table 5), how is the supernet initialized in Stage 2 when Stage 1 is skipped? Is it initialized with the original HuBERT base or is it trained from scratch?

Thank you for your time!

Answer 1 · 2023-01-28T02:04:00.000Z

Hi, @pyf98
Thanks for your attention. We're glad to explain these questions as you mentioned.

In Stage 2 - Once-for-All Training, which model is used as the teacher? Is it the original HuBERT base, or the distilled model from Stage 1?
Answer: The original HuBERT base serve as the teacher model in Stage 2 - Once-for-All Training as the same as that of Stage 1.
How is the small supernet initialized in Stage 2? I guess it is also initialized with the distilled model from Stage 1, but their sizes are different?
Answer: Yes, the the small supernet is initialized with the distilled model from Stage 1. The small network is nested in the distilled model and its weights are part of the distilled model. How the small network nested in large networks can be explained in other once-for-all network, e.g., Figure 2, 3, and 4 in EfficientTDNN.
In the ablation study (Table 5), how is the supernet initialized in Stage 2 when Stage 1 is skipped? Is it initialized with the original HuBERT base or is it trained from scratch?
Answer: The supernet is trained from scratch (i.e., initialized with random weights) in Stage 2 when Stage 1 is skipped. The case is perform to study how the Stage 1 works.

Have a nice day.

Answer 2 · 2023-02-23T22:41:58.000Z

Thank you for the reply.

How many GPU hours are needed for the two training stages? Just an estimate would be fine.

Congratulations again on this nice piece of work!

Answer 3 · 2023-02-24T08:48:06.000Z

Hi @pyf98
The details of GPU hours are summarized as follows.

stage 1 costs ~62 hours and 32 V100 within 400k updates.
stage 2 with small supernet costs ~19hours and 8 V100 within 200k updates while stage 2 with base supernet costs ~18hours.

Note that the max tokens we used are 1900 K in both of stage 1 and 2, as the same as that of data2vec.