naver/sqlova

train_shallow_layer.py doesn't train correctly

hanrelan opened this issue · 4 comments

I'm trying to train the shallow-layer model and after 4-5 epochs I'm still seeing acc_lx close to zero. Is that normal? If you have an example training run log and the associated losses, that would be great. I want to make sure that something isn't broken before letting it train for a couple days.

The loss actually doesn't seem to change at all between epochs so I think training isn't happening, but I haven't modified the source code other than the paths.

Including my training configuration and output log below. I have an 11GB GPU so I had to change the batch size and gradient accumulation to prevent out of memory errors.

python train_shallow_layer.py --seed 1 --bS 8 --accumulate_gradients 4 --bert_type_abb uS --fine_tune --lr 0.001 --lr_bert 0.00001 --max_seq_leng 222

BERT-type: uncased_L-12_H-768_A-12
Batch_size = 32
BERT parameters:
learning rate: 1e-05
Fine-tune BERT: True
vocab size: 30522
hidden_size: 768
num_hidden_layer: 12
num_attention_heads: 12
hidden_act: gelu
intermediate_size: 3072
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
max_position_embeddings: 512
type_vocab_size: 2
initializer_range: 0.02
Load pre-trained parameters.
Seq-to-SQL: the number of final BERT layers to be used: 1
Seq-to-SQL: the size of hidden dimension = 100
Seq-to-SQL: LSTM encoding layer size = 2
Seq-to-SQL: dropout rate = 0.3
Seq-to-SQL: learning rate = 0.001


train results ------------
 Epoch: 0, ave loss: 6.216941231456922, acc_sc: 0.163, acc_sa: 0.717, acc_wn: 0.590,         acc_wc: 0.092, acc_wo: 0.547, acc_wvi: 0.016, acc_wv: 0.016, acc_lx: 0.000, acc_x: 0.001
dev results ------------
 Epoch: 0, ave loss: 6.288717828444157, acc_sc: 0.174, acc_sa: 0.715, acc_wn: 0.683,         acc_wc: 0.143, acc_wo: 0.658, acc_wvi: 0.016, acc_wv: 0.027, acc_lx: 0.000, acc_x: 0.001
 Best Dev lx acc: 0.00023750148438427741 at epoch: 0
train results ------------
 Epoch: 1, ave loss: 6.191903309470515, acc_sc: 0.166, acc_sa: 0.720, acc_wn: 0.692,         acc_wc: 0.113, acc_wo: 0.668, acc_wvi: 0.028, acc_wv: 0.028, acc_lx: 0.000, acc_x: 0.001
dev results ------------
 Epoch: 1, ave loss: 6.2836473057494, acc_sc: 0.168, acc_sa: 0.715, acc_wn: 0.683,         acc_wc: 0.148, acc_wo: 0.658, acc_wvi: 0.0
07, acc_wv: 0.014, acc_lx: 0.000, acc_x: 0.000
 Best Dev lx acc: 0.00023750148438427741 at epoch: 0
train results ------------
 Epoch: 2, ave loss: 6.187300067725954, acc_sc: 0.167, acc_sa: 0.720, acc_wn: 0.693,         acc_wc: 0.113, acc_wo: 0.669, acc_wvi: 0.033, acc_wv: 0.033, acc_lx: 0.000, acc_x: 0.001
dev results ------------
 Epoch: 2, ave loss: 6.283452489599453, acc_sc: 0.169, acc_sa: 0.715, acc_wn: 0.683,         acc_wc: 0.152, acc_wo: 0.658, acc_wvi: 0.000, acc_wv: 0.000, acc_lx: 0.000, acc_x: 0.001
 Best Dev lx acc: 0.00023750148438427741 at epoch: 0

I've confirmed that there is an issue with train_shallow_layer.py. Specifically the test I ran was to run both train.py (nl2sql) and train_shallow_layer.py overnight. The nl2sql model reached 80% accuracy after 12 epochs, while the shallow-layer model remained at 0% even after 21 epochs of training. CLI parameters for both were the same.

Any ideas on what might be causing train_shallow_layer.py to fail?

Hi @hanrelan

That may be caused by using improper learning rate. In shallow layer, args.lr is used to train BERT (sorry for the confusion) and 1e-3 is too fast.. I have modified the code to avoid confusion. Please try again with same command (note that args.lr is not used anymore in train_shallow_layer.py 0e1794f).

Thanks!

Wonseok

Ah, makes sense, thanks. I'll try it again tonight and close this issue if it works

That fixed it, getting 78% lx accuracy now. Thanks!