Abstract
Fine-tuning of self-supervised models is a powerful transfer learning method in a variety of fields, including speech pro- cessing, since it can utilize generic feature representations obtained from large amounts of unlabeled data. Fine-tuning however, requires a new parameter set for each downstream task, which is parameter inefficient. Adapter architecture is proposed to partially solve this issue by inserting lightweight learnable modules into a frozen pre-trained model. However, existing adapter architectures fail to adaptively leverage low- to high-level features stored in different layers, which is necessary for solving various kinds of speech processing tasks. Thus, we propose a new adapter architecture to acquire fea- ture representations more flexibly for various speech tasks. In experiments, we applied this adapter to WavLM on four speech tasks. It performed on par or better than na ̈ıve fine-tuning with only 11% of learnable parameters. It also outperformed an existing adapter architecture.
Adapter Architecture
The proposed adapter architecture incorporates two types of adapters, namely Layer adapters (L-adapters) and Encoder adapter (E-adapters), into a frozen backbone. The L-adapters bridge each intermediate layer and the top layer as shown in Figure 1a. They help the model to quickly adapt speech representations to various downstream tasks, and also to reduce dependency on the initialization of adapter parameters. The E-adapters are inserted to each encoder layer in a similar way as previous work (https://arxiv.org/pdf/2202.03218.pdf) as shown in Figure 1b. In contrast to the previous work, our architecture does not have adapters after the multi-head self-attention (MHSA) modules, and alternatively has L-adapters. We use wavlm-base-plus as the model backbone.
Installation and Running experiments
You need to install packages necessary for running experiments. Please run the following command.
pip install -r requirement.txt
The following command provides an example of training in the proposed method. Please select task_name from ASV, ER, ASR, or IC.
# ./run.sh task_name
./run.sh ASR
Experiment and Results
We demonstrate the effectiveness of the proposed method on four downstream tasks: automatic speaker verification (ASV), emotion recognition (ER), automatic speech recognition (ASR) and intent classification (IC). We conduct experiments to compare the performance of different five training methods in four tasks. All experiments in this work were conducted with four 16GB memory GPUs.
We run experiments on five training methods as follows.
The performance comparison is shown in the figure and the table below. The table shows the error rate values of the right ends of curves in the figure.
Method | # Params | ASV | ER | ASR | IC |
---|---|---|---|---|---|
Fine-tuning | 85.1 M | ||||
Conventional method | 9.53 M | ||||
Proposed method | 9.13 M | ||||
L-adapters-only | 4.74 M | ||||
E-adapters-only | 4.79 M |
Optimal learning rates
We used a scheduler that warms to the maximum learning rate and then decays for ASV, ASR all but the conventional method, and IC. A scheduler that decays every certain step size from the initial learning rate is used for the others. We chose the best maximum and initial learning rates from {1e-3, 5e-4, 1e-4, 5e-5, 1e-5} for each architecture all but the down stream head.
Method | Module | ASV | ER | ASR | IC |
---|---|---|---|---|---|
Fime-tuning | Downstream head Encoder |
5e-4 1e-4 |
5e-4 5e-5 |
1e-2 1e-4 |
5e-4 1e-4 |
Conventional method | Downstream head Adapters Layernorm layer |
5e-4 1e-5 1e-5 |
5e-4 1e-5 1e-5 |
1e-3 1e-5 1e-5 |
5e-4 1e-5 1e-5 |
Proposed method | Downstream head L-adapters E-adapters Layernorm layer |
5e-4 1e-4 1e-5 1e-5 |
5e-4 1e-4 5e-5 5e-5 |
2e-3 1e-3 1e-3 1e-3 |
5e-4 1e-5 1e-5 1e-5 |
L-adapters-only | Downstream head L-adapters Layernorm layer |
5e-4 5e-4 5e-4 |
5e-4 5e-4 5e-4 |
2e-3 1e-3 1e-3 |
5e-4 1e-4 1e-4 |
E-adapters-only | Downstream head E-adapters Layernorm layer |
5e-4 1e-5 1e-5 |
5e-4 1e-5 1e-5 |
2e-3 1e-5 1e-5 |
5e-4 1e-5 1e-5 |
Citation
Please use the following citation for this work:
@inproceedings{otake2023parameter,
title = {Parameter Efficient Transfer Learning for Various Speech Processing Tasks},
author = {S. Otake, R. Kawakami, N. Inoue},
booktitle = {Proc. ICASSP},
year = {2023},
}
Note
The paper was uploaded to arXiv on 6 Dec 2022.
The paper was accepted for ICASSP 2023.