training language model on custom data - Missing key block_size
Closed this issue · 2 comments
enpassanty commented
trying to train a mlm on custom data. the sequences in the csv are long - when training on huggingface run_mlm.py, I truncate at 512 tokens. How do I access max_length
arg? why am I hitting a block_size key error? is this required for custom data?
! python train.py \
task=nlp/language_modeling \
dataset.cfg.train_file="/content/gdrive/MyDrive/nlp-chart/train charts.csv" \
dataset.cfg.validation_file="/content/gdrive/MyDrive/nlp-chart/test charts.csv" \
backbone.pretrained_model_name_or_path=roberta-base \
training.batch_size=8
traceback:
`2021-04-24 13:40:50.319301: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
dataset:
_target_: lightning_transformers.task.nlp.language_modeling.LanguageModelingDataModule
cfg:
batch_size: ${training.batch_size}
num_workers: ${training.num_workers}
dataset_name: null
dataset_config_name: null
train_file: /content/gdrive/MyDrive/nlp-chart/train charts.csv
validation_file: /content/gdrive/MyDrive/nlp-chart/test charts.csv
test_file: null
train_val_split: null
max_samples: null
cache_dir: null
padding: max_length
truncation: only_first
preprocessing_num_workers: 1
load_from_cache_file: true
max_length: 128
limit_train_samples: null
limit_val_samples: null
limit_test_samples: null
task:
_recursive_: false
_target_: lightning_transformers.task.nlp.language_modeling.LanguageModelingTransformer
optimizer: ${optimizer}
scheduler: ${scheduler}
backbone: ${backbone}
downstream_model_type: transformers.AutoModelForCausalLM
tokenizer:
_target_: transformers.AutoTokenizer.from_pretrained
pretrained_model_name_or_path: ${backbone.pretrained_model_name_or_path}
use_fast: true
backbone:
pretrained_model_name_or_path: roberta-base
optimizer:
_target_: torch.optim.AdamW
lr: ${training.lr}
weight_decay: 0.001
scheduler:
_target_: transformers.get_linear_schedule_with_warmup
num_training_steps: -1
num_warmup_steps: 0.1
training:
run_test_after_fit: true
lr: 5.0e-05
output_dir: .
batch_size: 8
num_workers: 16
trainer:
_target_: pytorch_lightning.Trainer
logger: true
checkpoint_callback: true
callbacks: null
default_root_dir: null
gradient_clip_val: 0.0
process_position: 0
num_nodes: 1
num_processes: 1
gpus: null
auto_select_gpus: false
tpu_cores: null
log_gpu_memory: null
progress_bar_refresh_rate: 1
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 1
min_epochs: 1
max_steps: null
min_steps: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
val_check_interval: 1.0
flush_logs_every_n_steps: 100
log_every_n_steps: 50
accelerator: null
sync_batchnorm: false
precision: 32
weights_summary: top
weights_save_path: null
num_sanity_val_steps: 2
truncated_bptt_steps: null
resume_from_checkpoint: null
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_epoch: false
auto_lr_find: false
replace_sampler_ddp: true
terminate_on_nan: false
auto_scale_batch_size: false
prepare_data_per_node: true
plugins: null
amp_backend: native
amp_level: O2
move_metrics_to_cpu: false
experiment_name: ${now:%Y-%m-%d}_${now:%H-%M-%S}
log: false
ignore_warnings: true
[2021-04-24 13:40:53,946][datasets.builder][WARNING] - Using custom data configuration default-a4347468916cb6de
[2021-04-24 13:40:53,948][datasets.builder][WARNING] - Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-a4347468916cb6de/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
0% 0/10 [00:00<?, ?ba/s]Token indices sequence length is longer than the specified maximum sequence length for this model (2655 > 512). Running this sequence through the model will result in indexing errors
100% 10/10 [00:33<00:00, 3.34s/ba]
100% 1/1 [00:01<00:00, 1.78s/ba]
Traceback (most recent call last):
File "train.py", line 88, in <module>
hydra_entry()
File "/usr/local/lib/python3.7/dist-packages/hydra/main.py", line 33, in decorated_main
config_name=config_name,
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 370, in _run_hydra
lambda: hydra.run(
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 373, in <lambda>
overrides=args.overrides,
File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/hydra.py", line 98, in run
configure_logging=with_log_configuration,
File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 129, in run_job
ret.return_value = task_function(task_cfg)
File "train.py", line 84, in hydra_entry
main(cfg)
File "train.py", line 78, in main
logger=logger,
File "train.py", line 53, in run
data_module.setup("fit")
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/core/nlp/data.py", line 33, in setup
dataset = self.process_data(dataset, stage=stage)
File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/task/nlp/language_modeling/data.py", line 54, in process_data
convert_to_features = partial(self.convert_to_features, block_size=self.effective_block_size)
File "/usr/local/lib/python3.7/dist-packages/lightning_transformers/task/nlp/language_modeling/data.py", line 67, in effective_block_size
if self.cfg.block_size is None:
File "/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py", line 352, in __getattr__
key=key, value=None, cause=e, type_override=ConfigAttributeError
File "/usr/local/lib/python3.7/dist-packages/omegaconf/base.py", line 195, in _format_and_raise
type_override=type_override,
File "/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py", line 701, in format_and_raise
_raise(ex, cause)
File "/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py", line 599, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set end OC_CAUSE=1 for full backtrace
File "/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py", line 349, in __getattr__
return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
File "/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py", line 416, in _get_impl
node = self._get_node(key=key, throw_on_missing_key=True)
File "/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py", line 448, in _get_node
raise ConfigKeyError(f"Missing key {key}")
omegaconf.errors.ConfigAttributeError: Missing key block_size
full_key: block_size
object_type=dict
SeanNaren commented
Could you try the branch here? #160
You can set the block size using this branch like: dataset.cfg.block_size=512
from the command line!
enpassanty commented
this solved the problem for me. thanks!