Potential max_input_len Issue/Inconsistency?
williambarberjr opened this issue ยท 4 comments
Please check that this issue hasn't been reported before.
- I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I expect to see input/output that combines for a token length of less than 8192 after the chat template is applied to be retained in full as valid training data when examining the arrow files in the /last_run_prepared/ data after running python -m axolotl.cli.preprocess instruct-lora-8b.yml
Current behaviour
Training data is being cut off at a max length of 2048.
Steps to reproduce
My yml sets sequence_len: 8192
but the logs keep printing max_input_len
as having been set to 2048. Even when I alter the source code in src/axolotl/utils/trainer.py
to hard code max_input_len: 7192
and def drop_long_seq(sample, sequence_len=2048, min_sequence_len=2)
to def drop_long_seq(sample, sequence_len=7192, min_sequence_len=2)
, and the log print out confirms at least that max_input_len
has been set to 7192 after rebuilding/reinstalling the axolotl, the training data still gets cut off at a length of 2048 tokens. The issue persistently occurs, when using these datasets
/chat_template
settings:
chat_template: llama3
datasets:
- path: williambarberjr/L3_8B_Instruct_MarkdownToSummaryConvert
type: chat_template
chat_template: llama3
field_messages: messages
message_field_role: role
message_field_content: content
roles:
user:
- user
assistant:
- assistant
system:
- system
However, when I set my own custom chat template like this:
datasets:
- path: ./cleanAxolotlTrainingDataL3_8B_Instruct.jsonl
type: input_output
And have the jsonl data already prepped with all the correct beginning, ending etc. chat template tokens. It doesn't cut the length at a max of 2048. Here's how I'm printing out the prepared data to check if the template looks correct. First I run python -m axolotl.cli.preprocess instruct-lora-8b.yml
in the command line then I run this python code:
import json, yaml
from transformers import AutoTokenizer
from datasets import load_from_disk
import os
#find the yml file in the current directory
ymlFile = next(f for f in os.listdir('.') if f.endswith('.yml'))
with open(ymlFile, 'r') as f:
cfg = yaml.safe_load(f)
model_id = cfg['base_model']
tok = AutoTokenizer.from_pretrained(model_id)
#get the subdirectory under last_run_prepared
subdirToLoad = next(subdir for subdir in os.listdir('last_run_prepared') if os.path.isdir(os.path.join('last_run_prepared', subdir)))
ds = load_from_disk(f'last_run_prepared/{subdirToLoad}')
with open('chatTemplateExample.txt', 'w') as f:
f.write(tok.decode(ds['input_ids'][0]))
Thoughts on what might be causing this?
Config yaml
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
chat_template: llama3
datasets:
- path: williambarberjr/axolotlTrainingDataL3_8B_Instruct.jsonl
type: chat_template
chat_template: llama3
field_messages: messages
message_field_role: role
message_field_content: content
roles:
user:
- user
assistant:
- assistant
system:
- system
dataset_prepared_path:
val_set_size: 0.1
output_dir: ./lora-out
data_seed: 49
seed: 49
sequence_len: 8192
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true
adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_modules_to_save:
- embed_tokens
- lm_head
wandb_project: markdownToSummaryLoraAllExamples
wandb_entity: williambarberjr
wandb_watch: gradients
wandb_name: instruct_lora_L3_8B_adamw_all_ex
wandb_log_model: checkpoint
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_8bit
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: .00000001
lr_scheduler: constant
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:
auto_resume_from_checkpoints: true
warmup_steps: 10
eval_batch_size: 2
eval_max_new_tokens: 128
save_total_limit: 3
save_steps: 100
eval_steps: 100
early_stopping_patience: 3
weight_decay: 0.0
special_tokens:
bos_token: <|begin_of_text|>
eos_token: <|eot_id|>
pad_token: <|eot_id|>
Possible solution
Tried several ideas above including hard coding some variables to no avail. For whatever reason the custom chat template described above doesn't reproduce the same cut off training data issue.
Which Operating Systems are you using?
- Linux
- macOS
- Windows
Python Version
Python 3.10.14
axolotl branch-commit
main
Acknowledgements
- My issue title is concise, descriptive, and in title casing.
- I have searched the existing issues to make sure this bug has not been reported yet.
- I am using the latest version of axolotl.
- I have provided enough information for the maintainers to reproduce and diagnose the issue.
we had the same issue, it appears that max_length is somehow hardcoded and did not involve the value input in yml file.
Changing value allow to resolve the issue.
@williambarberjr you could probably pass max_length=8192
in the yml file
datasets:
- path: williambarberjr/L3_8B_Instruct_MarkdownToSummaryConvert
type: chat_template
chat_template: llama3
max_length: 8192
field_messages: messages
message_field_role: role
message_field_content: content
roles:
user:
- user
assistant:
- assistant
system:
- system
we had the same issue, it appears that max_length is somehow hardcoded and did not involve the value input in yml file.
Changing value allow to resolve the issue.
If I remember correctly, I tried this and it didn't work for me but it's possible I failed to rebuilt the package before retrying. Regardless for my next runs I'm likely going to stick with the script I have that prepares my data in type: input_output format going forwards as I know that works and I don't really use the --gradio option to test the model at the end, I've started to default to spinning up vllm. And vllm applies the chat template correctly it seems. So I have a work around but I wanted to put this issue out there so others are aware and maybe eventually we can get it fixed.
Since #1818, the max_length
is set to the sequence_len
parameter.