byeonghu-na/MATRN

Question about pretraining on language model

Closed this issue · 5 comments

Hi, thank you for your nice work.
When I try to pretrain the language model, I have a problem like this:
image
Here is my yaml of pretrain-language config, I only changed the epoch-related values.

global:
  name: my-pretrain-language
  phase: train
  stage: pretrain-language
  workdir: results
  seed: ~

dataset:
  train: {
    roots: ['data/WikiText-103.csv'],
    batch_size: 1024
  }
  test: {
    roots: ['data/WikiText-103_eval_d1.csv'],
    batch_size: 1024
  }
  valid: {
    roots: [ 'data/validation' ],
    batch_size: 384
  }

training:
  epochs: 80
  show_iters: 50
  eval_iters: 100
  save_iters: 3000

optimizer:
  type: Adam
  true_wd: False
  wd: 0.0
  bn_wd: False
  clip_grad: 20
  lr: 0.0001
  args: {
    betas: !!python/tuple [0.9, 0.999], # for default Adam
  }
  scheduler: {
    periods: [70, 10],
    gamma: 0.1,
  }

model:
  name: 'modules.model_language.BCNLanguage'
  language: {
    num_layers: 4,
    loss_weight: 1.,
    use_self_attn: False
  }

May I ask if you have encountered any relevant situation?
Thank you!

Also tried to use the default yaml pretrain_language_model.yaml, but I got the same error.

Hi, I ran with your yaml file and default yaml file, and both are working.
My run script is:

python main.py --config=configs/pretrain_language_model.yaml

Could you check it again, and give more information of your run environment?

Hi, I ran with your yaml file and default yaml file, and both are working. My run script is:

python main.py --config=configs/pretrain_language_model.yaml

Could you check it again, and give more information of your run environment?

Hi, thank you for your answer. I ran with the default yaml file, and just changed the batch_size and eval_iters.
The same error occured after first eval_iter.
image
Here is the default yaml file

global:
  name: pretrain-language-model
  phase: train
  stage: pretrain-language
  workdir: results
  seed: ~
 
dataset:
  train: {
    roots: ['data/WikiText-103.csv'],
    batch_size: 1024
  }
  test: {
    roots: ['data/WikiText-103_eval_d1.csv'],
    batch_size: 1024
  }
  valid: {
    roots: [ 'data/validation' ],
    batch_size: 384
  }

training:
  epochs: 80
  show_iters: 50
  eval_iters: 100
  save_iters: 3000

optimizer:
  type: Adam
  true_wd: False
  wd: 0.0
  bn_wd: False
  clip_grad: 20
  lr: 0.0001
  args: {
    betas: !!python/tuple [0.9, 0.999], # for default Adam 
  }
  scheduler: {
    periods: [70, 10],
    gamma: 0.1,
  }

model:
  name: 'modules.model_language.BCNLanguage'
  language: {
    num_layers: 4,
    loss_weight: 1.,
    use_self_attn: False
  }

I am confused about the error. Maybe it is the wrong version of some packages.
I ran it on 1 2080Ti, and some primary package version are as followed:

torch=1.7.1
torchversion=0.8.2
Pillow=8.3.2
opencv-python=4.6.0.66

Would you mind sharing your env which I could compare with it?
Sorry for the trouble, thanks!

Oh, I see. The problem occurs because of validation dataset.
We now fix this problem (efd29dd) (we only evaluate test dataset (not validation dataset) when pretraining language model) and the code is now working!

Thank you for your letting me know the error.

It works after updating the code! I will close this issue.
Thanks for your work, have a nice day!