huggingface/autotrain-advanced

Do we need to insert bos and eos tokens fully in the train.csv?

Closed this issue ยท 32 comments

When fine-tuning LLM using train.csv, does the sample require the full template which includes the bos and eos?

For example, if the model bos_token is <s>, do I need to include it into the train.csv sample as well?

@abhishekkrthakur, for example, for mistral 7b instruct v0.3,
will each of the example in train.csv look something like this in the text column:
<s>[INST] hi this is user[/INST] this is assistant </s>

is this right? it will include the entire thing with the chat template already included, right?

this is when the chat_template is set to null in the .yml file

you dont need to format dataset and keep it in json format if using chat template. for example no_robots dataset. if you dataset is plain test and you are training a chat model, you need to add special tokens and tags. there is a parameter that can add end token by the way.

@abhishekkrthakur, i am currently using plain text. However, my plain text already contains all the special tokens and tags from pre-applying the chat template myself. Is this approach ok?

so for example, one of my sample inside train.csv can be:
<s>[INST] hi this is user[/INST] this is assistant </s>, which is already applied with the chat template beforehand. Just double confirming that this is ok for autotrain? So my train.csv will contains all the plain text in text column that already applied with chat template.

yes. in that case, make sure chat_template is set to none.

ok thanks for the swift reply. was trying to re-confirm, so I don't get errors like having BOS token applying again on top of my train.csv during fine-tuning.

@abhishekkrthakur however, I realized that when the tokenizer is encoding the plain text, I think it will automatically add the BOS token again. Is that right? in this case, do I need to remove the BOS token from train.csv? and add them during inference?

how I know this is --> when tokenizer encode the plain text, and then i use it decode back, it will automatically add the BOS token to it, which result in me having two BOS tokens.

@abhishekkrthakur for example:

messages = '''<s>[INST] You are an expert Python programmer. Your task is to do this.[/INST]
assistant goes here
</s>'''

tokenizer(messages) <-- this will output <s> again on top of the encoded messages string.

By doing this, the <s> will be applied again. So I was wondering if autotrain will do this for my plain text? or should my plain text include all special tags EXCEPT BOS token? i cannot seem to find any info from the source code on this. would kindly need your advise

choose llm generic and disable option to add end token and it will be fine.

@abhishekkrthakur Could you help on this? Just re-clarifying. This is my original .yml file below. Do I change the task in the first line from llm-sft to llm, then insert add_eos_token: false under data section, then also add trainer: default under data section? Did I leave out anything? I am using Lora too.

will this cause any problems not using sft? since my task is actually sft.

task: llm-sft
base_model: /scratch/xxx
project_name: xxx
log: none
backend: local

data:
  path: /home/xxx
  train_split: train
  valid_split: null
  chat_template: null
  column_mapping:
    text_column: text

params:
  block_size: 4096
  model_max_length: 4096
  epochs: 20
  batch_size: 4 
  lr: 1e-4
  peft: true
  quantization: int4
  target_modules: "q_proj,v_proj,o_proj,k_proj,gate_proj,down_proj,up_proj" 
  padding: right
  optimizer: adamw_torch
  scheduler: cosine
  gradient_accumulation: 16 
  mixed_precision: bf16         
  warmup_ratio: 0.1
  weight_decay: 0.1
  lora_r: 16
  lora_alpha: 16
  lora_dropout: 0
  merge_adapter: false
  use_flash_attention_2: true  
  logging_steps: 1
  unsloth: false
  seed: 42

@abhishekkrthakur because there is no generic template configs, hence I am not 100% sure on this. would be really beneficial if you can clarify on this.

its just llm. if you remove the sft and use only llm or add trainer: default, its generic training.

ill add a config :) thanks for letting me know

its just llm. if you remove the sft and use only llm or add trainer: default, its generic training.

sorry @abhishekkrthakur , you meant, i can simply just change the task from llm-sft to llm, and it should be ok? I believe i also need to add add_eos_token: false under params section is that right?

side note: does it mean if I use llm-sft as per what I did originally, then I do not have to add the BOS token and EOS token, since it will be applied on it during finetuning? it's abit contradicting because earlier you mentioned we need to add the special tokens to my plain text.

Let's say I want to do supervised fine-tuning via plain text. The plain text already contained all the chat template thats applied, as mentioned in the comment above.

You are saying I can use llm generic for this task, instead of llm sft? whats the difference between the two?

@abhishekkrthakur would greatly appreciate your reply on this, thanks a ton

ill add an example for your use case and update here asap :)

@abhishekkrthakur thanks, I hope my problem was clear.

I have a plain text (just a string) that is already applied with chat template. Which means it will include all the special tokens and tags. Things like EOS and BOS tokens will be applied as well already within this plain text.

I do not wish to have a duplicate BOS token (or EOS token) applying during fine tuning process using autotrain (llm-sft), because the tokenizer will add automatically append BOS token during fine tuning, resulting in double BOS tokens.

Wondering if generic llm like you mentioned can tackle this problem.

Thanks for the help! Looking forward to the update ๐Ÿ‘๐Ÿป๐Ÿ™๐Ÿป

@abhishekkrthakur do you have any updates on this? I just want to ensure that I am doing the correct approach using your awesome package, for my use case (plain text in train.csv already has the full chat template applied). Thank you so much

unfortunately, i didnt get a chance to look deeper into it yet. but i will do it and update here as soon as possible. thank you for your patience.

@abhishekkrthakur any updates on this? i just need to know if your generic trainer will automatically add special tokens (i.e. BOS token) when tokenizing the dataset.

tokenizer(text, add_special_tokens=False).input_ids
for instance, the add_special_tokens setting to False will not add the BOS token to the plain text. does your generic trainer set add_special_tokens=True by default?

It seems like it doesnt, because you have this in your utils:

def tokenize(examples, tokenizer, config):
    output = tokenizer(examples[config.text_column])
    return output

and you simply tokenize the dataset plain text. therefore, it will automatically add the BOS token.

@abhishekkrthakur hi, any updates on this? thanks

hi. the best way as of now is to use SFT Trainer without the chat template. You can have all the data in the text column and you can format it the way you want. The same works for generic trainer. you have an extra option of add_eos_token in generic trainer which you can set to false in case you are formatting the data yourself.

@abhishekkrthakur, i have explained all the confusion above, and you still did not address my question... i am just repeating myself over and over

let's say i am using generic trainer and set add_eos_token to be false... so i assume I will add all the data in the text column except for the <s>, right? because the tokenizer will add BOS token automatically.

add_eos_token is added at the end of each sample. so that depends on your block_size. if add_eos_token is true, each sample will end with whatever the eos token is in the tokenizer.

only relevant code is here:


The tokenizer is called like this:
def tokenize(examples, tokenizer, config):

hi @abhishekkrthakur, thanks for the reply. i understand you might be swarmed with work, so it is inevitable to miss out the question - so let me be explicit again:

i am NOT talking about eos token. I am talking about BOS token, let's say, <s> in my case. When the script tokenize my samples in the text column, it will automatically have the BOS token appended in FRONT.

as such, the samples in the text column should have EVERYTHING, except for the BOS token. <-- this is what I am clarifying. would appreciate if you can let me know if that statement is true/false.

in other words, let's say when using the generic trainer, my text column should be:
messages = '''[INST] You are an expert Python programmer. Your task is to do this.[/INST]
assistant goes here
'''

and NOT:
messages = '''<s>[INST] You are an expert Python programmer. Your task is to do this.[/INST]
assistant goes here
</s>'''

right?

No. The tokenizer wont add the BOS token. Apologies for the overlook

@abhishekkrthakur, thanks for your reply.

hence i am wondering why is that/where did it explicitly state that BOS token will not be added when passing through the tokenizer? would appreciate if you can direct me to that information

because in the utils.py script:

def tokenize(examples, tokenizer, config):
    output = tokenizer(examples[config.text_column])
    return output

when the examples in the text column are tokenized, my model's tokenizer will append BOS token to it automatically. I have tested this locally via tokenizer = AutoTokenizer.from_pretrained(model_id), and my string will have <s> appended to it when using the tokenizer.

In [11]: tok.decode([21017, 5524, 25, 1680, 345, 3551, 257, 1790, 9793, 546, 262, 23082, 286, 262, 3381, 366, 2144, 404, 1559, 88, 1, 287, 12446, 30, 4222, 779, 6096, 3519, 284, 2785, 158
    ...: 48, 1559, 444, 287, 262, 10515, 1910, 290, 21729, 5981, 2267, 13, 21017, 15286, 25, 366, 9069, 404, 1559, 88, 1, 10229, 284, 257, 1910, 4645, 810, 612, 318, 691, 530, 17872, 329,
    ...:  257, 1948, 922, 393, 2139, 13, 554, 12446, 11, 428, 3381, 318, 3573, 5981, 287, 262, 4827, 1910, 11, 810, 257, 15848, 1559, 88, 9749, 468, 2383, 1176, 625, 262, 9400, 290, 1762,
    ...:  3403, 286, 511, 4409, 13, 383, 4931, 286, 257, 15848, 1559, 88, 460, 1255, 287, 2793, 9400, 290, 5322, 7184, 6443, 329, 3259, 11, 355, 262, 9749, 468, 1310, 15660, 284, 2620, 94
    ...: 00, 393, 2148, 1365, 1762, 3403, 13, 198, 198, 26446, 2267, 468, 5174, 2785, 15848, 1559, 444, 287, 11798, 884, 355, 6308, 290, 3049, 2057, 11, 810, 257, 1178, 1588, 2706, 1630,
    ...: 257, 2383, 6903, 286, 262, 1910, 357, 33, 452, 641, 1222, 14136, 2978, 11, 2211, 737, 554, 777, 11798, 11, 3259, 1690, 1986, 1877, 9400, 11, 3614, 4034, 11, 290, 5322, 23189, 117
    ...: 6, 11, 3756, 284, 257, 3074, 810, 484, 389, 10795, 319, 262, 9749, 329, 511, 30489, 13, 770, 21403, 460, 1255, 287, 2252, 22711, 286, 9400, 290, 257, 7794, 287, 1762, 3403, 13, 1
    ...: 98, 198, 16350, 11, 262, 3721, 286, 15848, 1559, 88, 318, 6393, 284, 4547, 262, 17262, 286, 4827, 5939, 290, 262, 2928, 286, 1910, 1176, 319, 3259, 13, 7735, 2267, 318, 2622, 284
    ...: , 1833, 262, 6287, 290, 2928, 286, 15848, 1559, 444, 319, 262, 3773, 290, 284, 1205, 4788, 284, 2209, 428, 2071, 13, 198, 198, 19927, 25, 198, 33, 452, 641, 11, 449, 1539, 1222,
    ...: 14136, 2978, 11, 406, 13, 357, 6390, 737, 383, 7119, 286, 26040, 8393, 315, 1083, 290, 11302, 43793, 874, 355, 21259, 286, 371, 658, 287, 5849, 352, 22512, 554, 8988, 13, 4913, 2
    ...: 86, 11279, 29845, 1083, 11, 2681, 7, 18, 828, 7632, 12, 3695, 13, 21017, 5524, 25, 2735, 4727, 340, 284, 257, 3290, 50256])

Out[11]: '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.\n\nOverall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.\n\nReferences:\nBivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.### Human: Now explain it to a dog<|endoftext|>'

The input is the first sample from guanaco dataset tokenized by AutoTrain. add_eos_token is set to true. as we can see, BOS token is not added by the tokenizer.

Another example:

In [12]: from transformers import AutoTokenizer

In [13]: tok = AutoTokenizer.from_pretrained("openai-community/gpt2")

In [14]: tok("hi, how are you?")
Out[14]: {'input_ids': [5303, 11, 703, 389, 345, 30], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [15]: tok.decode([5303, 11, 703, 389, 345, 30])
Out[15]: 'hi, how are you?'

As we can see, neither EOS nor BOS tokens are appended.

image

as per the image above, it is behaving differently.
as such, should I then stick to whatever the respective model is giving?

i checked gpt2. let me check mistral and come back to you. it might be possible due to different configs

yes. it seems to be coming from the config:

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/blob/main/tokenizer_config.json#L2
image

I can disable adding both EOS and BOS token in AutoTrain to make it consistent across all models. Does that work?

@abhishekkrthakur, thank you so much for clarifying everything. this is clear now. in other words, when using certain models such as mistral, i should not include BOS token <s> in my text column, since it will be included when using their tokenizer in AutoTrain. (please correct me again if i am wrong here)

regarding AutoTrain, I believe:

  1. if chat_template = null and user wishes to perform training using some kind of instructional template, they should have pre-applied the chat template to their dataset beforehand
  2. this dataset pre-applied with chat template will then go into the text column for AutoTrain. this means that the samples in the text column will be included with EVERYTHING, including BOS, EOS, and whatever. the users should handle this fully themselves. As such, the tokenizer in AutoTrain should not further add special tokens to the text column.
  3. may be good to allow users to set both add_eos_token and add_bos_token under autotrain config, and not just add_eos_token

then again, i am not sure if disable adding both EOS and BOS token in autotrain will be better. these are just my 2 cents ^

yes you are right.

we have add_eos_token param to add the token only on sample end.
i think this is just a special case but ive to take a deeper look. i could not find add_bos_token in llama model for example: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/tokenizer_config.json

so, added an add_bos_token parameter and disabling addition by the model itself could work.