Why is the training not starting when using my own dataset?

Question

Why is the training not starting when using my own dataset?

Closed this issue 3 months ago · 2 comments

yoyoyo2025 commented 3 months ago

Hello, I have modified the training.yaml file to run on my own dataset. The modifications are as follows:
`name: "my_training_pipeline"

model:
target: ragfoundry.models.hf.HFTrain
model_name_or_path: "microsoft/Phi-3-mini-128k-instruct"
load_in_4bit: false
load_in_8bit: true
torch_dtype: # Can be set to "float16" or "bfloat16", if needed
device_map: "auto" # Automatically assign devices, or specify specific devices
trust_remote_code: true
lora:
bias: none
fan_in_fan_out: false
layers_pattern:
layers_to_transform:
lora_alpha: 16
lora_dropout: 0.1
peft_type: LORA
r: 16
target_modules:
- qkv_proj
task_type: CAUSAL_LM
use_rslora: true
completion_start:
instruction_in_prompt:
max_sequence_len: 4000

train:
output_dir: ./rag/RAGFoundry-main/models/finetuned_model
bf16: false
fp16: false
gradient_accumulation_steps: 4
group_by_length: false # Enable length grouping if needed
learning_rate: 2e-5
logging_steps: 10
lr_scheduler_type: cosine
max_steps: -1 # -1 means using num_train_epochs
num_train_epochs: 3
per_device_train_batch_size: 8
per_device_eval_batch_size: 8 # Batch size for validation set
optim: adamw_torch_fused
remove_unused_columns: true
save_steps: 20000
save_total_limit: 2
warmup_ratio: 0.03
weight_decay: 0.001
report_to: wandb # Use wandb for logging if needed

validation:
data_file: mypath/procedure_dev.jsonl
eval_steps: 1000 # Evaluate every 1000 steps
max_eval_samples: 1000 # Maximum number of validation samples

instruction: ragfoundry/processing/prompts/prompt_instructions/qa.txt
template: # specify a template file or use chatML format with tokenizer's chat template
data_file: mypath/procedure_train.jsonl
input_key: text1 # Assuming the input key in the data is text1
output_key: labels # Assuming the output key in the data is labels
resume_checkpoint:
limit:
shuffle: true # Shuffle the dataset
use_wandb: false # Set to false if not using wandb
hfhub_tag:
experiment: phi-training
wandb_entity:
logging:
level: DEBUG # Ensure all logs are captured
handlers:
- stream # Output to console only
My dataset is formatted as follows:{"text1": "FIN6 has used Windows Credential Editor for credential dumping.", "labels": ["T1003.001"]}
{"text1": "Ke3chang has dumped credentials, including by using Mimikatz.", "labels": ["T1003.001"]}
When I run the command python processing.py -cp configs/path -cn train, the output is:input_key: text1
output_key: labels
resume_checkpoint: null
limit: null
shuffle: true
use_wandb: false
hfhub_tag: null
experiment: phi-training
wandb_entity: null
logging:
level: DEBUG
handlers:

stream
[2024-08-11 23:11:47,629][root][INFO] - Caching state: True
0it [00:00, ?it/s]
`
It seems like the training did not start.

Answer 1 · 2024-08-11T16:35:13.000Z

If you want training, you need the training.py module, not processing.py.

Answer 2 · 2024-08-11T16:43:54.000Z

Thank you very much, I'll give it a try.