Lightning-Universe/lightning-transformers

Can't use a logger

Closed this issue ยท 4 comments

๐Ÿ› Bug

The script crashes when I try to log the run.

To Reproduce

Run:

python train.py task=nlp/text_classification  dataset=nlp/text_classification/emotion trainer.gpus=1 trainer.accelerator=dp log=true trainer.logger=tensorboard

And then I get:

> python train.py     task=nlp/text_classification  dataset=nlp/text_classification/emotion trainer.gpus=1 trainer.accelerator=dp log=true trainer.logger=tensorboard

  num_workers: 16
trainer:
  _target_: pytorch_lightning.Trainer
  logger: tensorboard
  checkpoint_callback: true
  callbacks: null
  default_root_dir: null
  gradient_clip_val: 0.0
  process_position: 0
  num_nodes: 1
  num_processes: 1
  gpus: 1
  auto_select_gpus: false
  tpu_cores: null
  log_gpu_memory: null
  progress_bar_refresh_rate: 1
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 1
  min_epochs: 1
  max_steps: null
  min_steps: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  val_check_interval: 1.0
  flush_logs_every_n_steps: 100
  log_every_n_steps: 50
  accelerator: dp
  sync_batchnorm: false
  precision: 32
  weights_summary: top
  weights_save_path: null
  num_sanity_val_steps: 2
  truncated_bptt_steps: null
  resume_from_checkpoint: null
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_epoch: false
  auto_lr_find: false
  replace_sampler_ddp: true
  terminate_on_nan: false
  auto_scale_batch_size: false
  prepare_data_per_node: true
  plugins: null
  amp_backend: native
  amp_level: O2
  move_metrics_to_cpu: false
experiment_name: ${now:%Y-%m-%d}_${now:%H-%M-%S}
log: true
ignore_warnings: true

Error executing job with overrides: ['task=nlp/text_classification', 'dataset=nlp/text_classification/emotion', 'trainer.gpus=1', 'trainer.accelerator=dp', 'log=true', 'trainer.logger=tensorboard']
Top level config has to be OmegaConf DictConfig, plain dict, or a Structured Config class or instance

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Expected behavior

Be able to run the script and log the training process!

Environment

  • PyTorch Version (e.g., 1.0): 11.2
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source): -
  • Python version: 3.8.10
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: A100
  • Any other relevant information:

Additional context

Please help :)

Just wanted to add that I have the exact same issue and it is a critical one. This looks like a great framework, but the fact that I can't see how training is progressing in the form of a readable log (e.g. tensorboard, wandb, etc) makes it not useable. Even if some of the documentation could walk users through this (assuming it isn't just a bug) it would be helpful.

Apologies on the late response here!

Logging was added under conf/trainer/loggers and can be enabled like such:

python train.py task=nlp/text_classification  dataset=nlp/text_classification/emotion trainer.gpus=1 trainer.accelerator=dp log=true +trainer/logger=tensorboard

appending +trainer/logger=tensorboard is the CLI command here. You can modify the save dir by appending trainer.logger.save_dir=my_directory/. See the conf directory for more loggers! https://github.com/PyTorchLightning/lightning-transformers/tree/master/conf/trainer/logger

I'll make a PR to update the documentation, and give this information under a new tab to close this PR

Great thanks!