CatastrophicForgetting-BERT

CF problem in BERT. Support contiual fine-tuning BERT on sequence classification tasks(ColA, MRPC and RTE) and token classification tasks(CoNLL-2003, UD).

Compare the prediction results of the first task after continual fine-tuning

Experiment1 : NER(10000) -> CoLA -> MRPC

Accuracy Matrix

CoNLL-2003	CoLA	MRPC
0.9883139	-0.0207027	0.3112745
0.9802894	0.5550197	0.4338235
0.9614164	0.3715264	0.8578431

Transfer Matrix

CoNLL-2003	CoLA	MRPC
0.	-0.5757224	-0.5465686
-0.0080245	0.	-0.4240196
-0.0268975	-0.1834933	0.

Explanation:

CoNLL-2003 's label lists: "O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC";

Frequency of each label in dev set: [42746, 922, 346, 1839, 1304, 1341, 751, 1837, 257]

avg. sequence length = 15.797846153846153, select the first 10000 sequences from the train set to do fine-tuning

cnt1 : how many times is the label correctly predicted in the first time? [42568, 836, 286, 1812, 1287, 1258, 682, 1781, 233]
cnt2 : how many times is the label correctly predicted in the second time? [42537, 712, 240, 1176, 1168, 1144, 670, 1516, 199]
cnt3 : how many times is the label correctly predicted in the first time but uncorrectly in the second time? [107, 140, 50, 637, 120, 119, 28, 267, 34], [0.00, 0.15, 0.14, 0.35, 0.09, 0.09, 0.04, 0.15, 0.13]
cnt4 : how many times is the label correctly predicted in the second time but uncorrectly in the first time? [76, 16, 4, 1, 1, 5, 16, 2, 0]

Experiment2: CoLA -> MRPC -> NER

Accuracy Matrix

CoLA	MRPC	CoNLL-2003
0.5699305	0.6838235	0.2488363
0.5528801	0.8455882	0.3384493
0.3818496	0.7647059	0.9872816

Transfer Matrix

CoLA	MRPC	CoNLL-2003
0.	-0.1617647	-0.7384453
-0.0170504	0.	-0.6488323
-0.1880809	-0.0808823	0.

CoLA's label list: label 0(ungrammatical), label 1(grammatical)

In CoLA's dev set, there are 1043 samples, in which 322 with label 0 and 721 with label 1.

cnt1 : sequences predicted correctly in the first time => 860
cnt2 : sequences predicted correctly in the second time => 789
cnt3 : sequences predicted correctly in the first time but predicted uncorrectly in the second time (forget) => 118
- cnt3_0 : how many times the label '0' are forgot? => 117, 0.36
- cnt3_1 : how many times the label '1' are forgot? => 1, 0.01
cnt4 : sequences predicted correctly in the second time but predicted uncorrectly in the first time => 47
- cnt4_0 : how many times is label '0' are learned? => 0
- cnt4_1 : how many times is label '1' are learned? => 47, 0.07

Experiment3: CoLA -> MRPC -> UD(POS)

Accuracy Matrix

CoLA	MRPC	UD
0.578641	0.6838235	0.0393673
0.5548848	0.8235294	0.0428605
0.314600	0.7328431	0.9685219

Transfer Matrix

CoLA	MRPC	UD
0.	-0.1397059	-0.9291546
-0.0237568	0.	-0.9256614
-0.2640415	-0.0906863	0.

CoLA's label list: label 0(ungrammatical), label 1(grammatical)

In CoLA's dev set, there are 1043 samples, in which 322 with label 0 and 721 with label 1.

cnt1 : sequences predicted correctly in the first time => 864
cnt2 : sequences predicted correctly in the second time => 772
cnt3 : sequences predicted correctly in the first time but predicted uncorrectly in the second time (forget) => 139
- cnt3_0 : how many times the label '0' are forgot? => 119, 0.37
- cnt3_1 : how many times the label '1' are forgot? => 20, 0.03
cnt4 : sequences predicted correctly in the second time but predicted uncorrectly in the first time => 47
- cnt4_0 : how many times is label '0' are learned? => 11, 0.03
- cnt4_1 : how many times is label '1' are learned? => 36, 0.05

Experiment4 : NER -> MNLI

Result Matrix

CoNLL-2003 (F1)	MNLI (ACC)
0.9473067	0.3552725
0.6580307	0.8366786

Transfer Matrix

CoNLL-2003 (F1)	MNLI (ACC)
0.	-0.4814061
-0.289276	0.

Explanation:

CoNLL-2003 's label lists: "O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC";

Frequency of each label in dev set: [42746, 922, 346, 1839, 1304, 1341, 751, 1837, 257]

avg. sequence length = 15.797846153846153, select all sequences from the train set to do fine-tuning

cnt1 : how many times is the label correctly predicted in the first time? [42589, 834, 291, 1812, 1289, 1257, 690, 1786, 236]
cnt2 : how many times is the label correctly predicted in the second time? [39261, 743, 273, 1755, 1281, 1139, 586, 1701, 101]
cnt3 : how many times is the label correctly predicted in the first time but uncorrectly in the second time? [3356, 117, 34, 67, 13, 132, 116, 105, 136], [0.0785, 0.1269, 0.0983, 0.0364, 0.01, 0.0984, 0.1545, 0.0572, 0.5292]
cnt4 : how many times is the label correctly predicted in the second time but uncorrectly in the first time? [28, 26, 16, 10, 5, 14, 12, 20, 1]

Package

python 3.6.9
PyTorch 1.7.0+cu101
Transformers 4.1.0.dev0
conllu 4.2.1
seqeval 1.2.2
tqdm 4.41.1

Code

run.py: the main code to run the training-evaluation code.
processors.py: the code to process text data.
utils.py: some code including formating time and converting arguments to dict.
arguments.py: some code to parse arguments.

Run the Code

Please use run_cl.sh to fine-tune BERT on several tasks sequentially. The order of tasks and the hyperparameters setting of each task are defined in order1.json. The arguments are introduced in arguments.py.

JingYannn/CatastrophicForgetting-BERT

CatastrophicForgetting-BERT

Compare the prediction results of the first task after continual fine-tuning

Experiment1 : NER(10000) -> CoLA -> MRPC

Accuracy Matrix

Transfer Matrix

Experiment2: CoLA -> MRPC -> NER

Accuracy Matrix

Transfer Matrix

Experiment3: CoLA -> MRPC -> UD(POS)

Accuracy Matrix

Transfer Matrix

Experiment4 : NER -> MNLI

Result Matrix

Transfer Matrix

Package

Code

Run the Code

TODO