Potential max_input_len Issue/Inconsistency?

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I expect to see input/output that combines for a token length of less than 8192 after the chat template is applied to be retained in full as valid training data when examining the arrow files in the /last_run_prepared/ data after running python -m axolotl.cli.preprocess instruct-lora-8b.yml

Current behaviour

Training data is being cut off at a max length of 2048.

Steps to reproduce

My yml sets sequence_len: 8192 but the logs keep printing max_input_len as having been set to 2048. Even when I alter the source code in src/axolotl/utils/trainer.py to hard code max_input_len: 7192 and def drop_long_seq(sample, sequence_len=2048, min_sequence_len=2) to def drop_long_seq(sample, sequence_len=7192, min_sequence_len=2), and the log print out confirms at least that max_input_len has been set to 7192 after rebuilding/reinstalling the axolotl, the training data still gets cut off at a length of 2048 tokens. The issue persistently occurs, when using these datasets/chat_template settings:

chat_template: llama3
datasets:
  - path: williambarberjr/L3_8B_Instruct_MarkdownToSummaryConvert
    type: chat_template
    chat_template: llama3
    field_messages: messages
    message_field_role: role
    message_field_content: content
    roles:
      user:
        - user
      assistant:
        - assistant
      system:
        - system

However, when I set my own custom chat template like this:

datasets:
  - path: ./cleanAxolotlTrainingDataL3_8B_Instruct.jsonl
    type: input_output

And have the jsonl data already prepped with all the correct beginning, ending etc. chat template tokens. It doesn't cut the length at a max of 2048. Here's how I'm printing out the prepared data to check if the template looks correct. First I run python -m axolotl.cli.preprocess instruct-lora-8b.yml in the command line then I run this python code:

import json, yaml
from transformers import AutoTokenizer
from datasets import load_from_disk
import os

#find the yml file in the current directory
ymlFile = next(f for f in os.listdir('.') if f.endswith('.yml'))
with open(ymlFile, 'r') as f:
    cfg = yaml.safe_load(f)
model_id = cfg['base_model']
tok = AutoTokenizer.from_pretrained(model_id)
#get the subdirectory under last_run_prepared
subdirToLoad = next(subdir for subdir in os.listdir('last_run_prepared') if os.path.isdir(os.path.join('last_run_prepared', subdir)))
ds = load_from_disk(f'last_run_prepared/{subdirToLoad}')

with open('chatTemplateExample.txt', 'w') as f:
    f.write(tok.decode(ds['input_ids'][0]))

Thoughts on what might be causing this?

Config yaml

base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: llama3
datasets:
  - path: williambarberjr/axolotlTrainingDataL3_8B_Instruct.jsonl
    type: chat_template
    chat_template: llama3
    field_messages: messages
    message_field_role: role
    message_field_content: content
    roles:
      user:
        - user
      assistant:
        - assistant
      system:
        - system

dataset_prepared_path:
val_set_size: 0.1
output_dir: ./lora-out
data_seed: 49
seed: 49

sequence_len: 8192
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_modules_to_save:
  - embed_tokens
  - lm_head

wandb_project: markdownToSummaryLoraAllExamples
wandb_entity: williambarberjr
wandb_watch: gradients
wandb_name: instruct_lora_L3_8B_adamw_all_ex
wandb_log_model: checkpoint

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_8bit
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: .00000001
lr_scheduler: constant
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:
auto_resume_from_checkpoints: true

warmup_steps: 10
eval_batch_size: 2
eval_max_new_tokens: 128
save_total_limit: 3
save_steps: 100
eval_steps: 100
early_stopping_patience: 3
weight_decay: 0.0
special_tokens:
   bos_token: <|begin_of_text|>
   eos_token: <|eot_id|>
   pad_token: <|eot_id|>

Possible solution

Tried several ideas above including hard coding some variables to no avail. For whatever reason the custom chat template described above doesn't reproduce the same cut off training data issue.

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

Python 3.10.14

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

we had the same issue, it appears that max_length is somehow hardcoded and did not involve the value input in yml file.

axolotl/src/axolotl/prompt_strategies/chat_template.py

Line 22 in 78e12f8

max_length=2048,

Changing value allow to resolve the issue.

@williambarberjr you could probably pass max_length=8192 in the yml file

datasets:
  - path: williambarberjr/L3_8B_Instruct_MarkdownToSummaryConvert
    type: chat_template
    chat_template: llama3
    max_length: 8192
    field_messages: messages
    message_field_role: role
    message_field_content: content
    roles:
      user:
        - user
      assistant:
        - assistant
      system:
        - system

we had the same issue, it appears that max_length is somehow hardcoded and did not involve the value input in yml file.

axolotl/src/axolotl/prompt_strategies/chat_template.py

Line 22 in 78e12f8

max_length=2048,

Changing value allow to resolve the issue.

If I remember correctly, I tried this and it didn't work for me but it's possible I failed to rebuilt the package before retrying. Regardless for my next runs I'm likely going to stick with the script I have that prepares my data in type: input_output format going forwards as I know that works and I don't really use the --gradio option to test the model at the end, I've started to default to spinning up vllm. And vllm applies the chat template correctly it seems. So I have a work around but I wanted to put this issue out there so others are aware and maybe eventually we can get it fixed.

Since #1818, the max_length is set to the sequence_len parameter.