[BUG] Running sentence-transformers:triplet task with local data will fail with exception: KeyError: 'target'

Question

[BUG] Running sentence-transformers:triplet task with local data will fail with exception: KeyError: 'target'

Closed this issue 3 months ago · 6 comments

S1yuan commented 3 months ago

Prerequisites

I have read the documentation.
I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config train.yml

UI Screenshots & Parameters

No response

Error Logs

Additional Information

the default value of target_column is set at autotrain/trainers/sent_transformers/params.py as follow:

but triplet task do not need a target_column, so train datas have no target_column in schema.

In this case, SentenceTransformersPreprocessor.prepare_columns will raise a KeyError exception because target_column have a default value 'target'.

To fix this bug, set the default value of target_column to None.

btw. the default value of sentence3_column has same bug.

Answer 1 · 2024-09-20T11:17:49.000Z

how does your dataset look like? can you post first few lines?

Answer 2 · 2024-09-23T11:53:01.000Z

how does your dataset look like? can you post first few lines?

@abhishekkrthakur
here are some few lines in my datasets:

{"anchor": "黑缘粗角肖叶甲触角有多大？", "positive": "触角近于体长之半，第1节粗大，棒状，第2节短，椭圆形，3、4两节细长，稍短于第5节，第5节基细端粗，末端6节明显粗大。", "negative": "体长卵形，棕红色；鞘翅棕黄或淡棕色，外缘和中缝黑色或黑褐色；触角基部3、4节棕黄，余节棕色。"}
{"anchor": "黑缘粗角肖叶甲触角有多大？", "positive": "触角近于体长之半，第1节粗大，棒状，第2节短，椭圆形，3、4两节细长，稍短于第5节，第5节基细端粗，末端6节明显粗大。", "negative": "头部刻点粗大，分布不均匀，头顶刻点十分稀疏；触角基部的内侧有一个三角形光瘤，唇基前缘呈半圆形凹切。"}
{"anchor": "黑缘粗角肖叶甲触角有多大？", "positive": "触角近于体长之半，第1节粗大，棒状，第2节短，椭圆形，3、4两节细长，稍短于第5节，第5节基细端粗，末端6节明显粗大。", "negative": "前胸背板横宽，宽约为长的两倍，侧缘敞出较宽，圆形，敞边与盘区之间有一条细纵沟；盘区刻点相当密，前半部刻点较大于后半部。"}
{"anchor": "黑缘粗角肖叶甲触角有多大？", "positive": "触角近于体长之半，第1节粗大，棒状，第2节短，椭圆形，3、4两节细长，稍短于第5节，第5节基细端粗，末端6节明显粗大。", "negative": "小盾片舌形，光亮，末端圆钝。"}
{"anchor": "黑缘粗角肖叶甲触角有多大？", "positive": "触角近于体长之半，第1节粗大，棒状，第2节短，椭圆形，3、4两节细长，稍短于第5节，第5节基细端粗，末端6节明显粗大。", "negative": "鞘翅刻点粗大，不规则排列，肩部之后的刻点更为粗大，具皱褶，近中缝的刻点较小，略呈纵行排列。"}
{"anchor": "黑缘粗角肖叶甲触角有多大？", "positive": "触角近于体长之半，第1节粗大，棒状，第2节短，椭圆形，3、4两节细长，稍短于第5节，第5节基细端粗，末端6节明显粗大。", "negative": "前胸前侧片前缘直；前胸后侧片具粗大刻点。"}

Answer 3 · 2024-09-23T11:58:52.000Z

and here are my train configs:

task: sentence-transformers:triplet
base_model: /mnt/bn/query-rewrite/autotrain/Alibaba-NLP/gte-large-en-v1.5
project_name: gte-large-en-v1-5-st-triplet-local-dataset
log: tensorboard
backend: local

data:
path: /mnt/bn/query-rewrite/autotrain/data/datas # this must be the path to the directory containing the train and valid files
train_split: train # this is the name of the train file (csv or jsonl)
valid_split: test # this is the name of the valid file (csv or jsonl), optional
column_mapping:
sentence1_column: anchor
sentence2_column: positive
sentence3_column: negative

params:
max_seq_length: 8192
epochs: 1
batch_size: 8
lr: 2e-5
optimizer: adamw_torch
scheduler: linear
gradient_accumulation: 1
mixed_precision: fp16

Answer 4 · 2024-09-23T12:10:40.000Z

okay. i was able to reproduce the error.
a quick fix would be to add target_column: null in column mappings. that way you should be able to train the model.
in the meantime, im taking a look at how to fix this issue properly.

Answer 5 · 2024-09-23T12:20:16.000Z

also fixed in version 0.8.20 and above.
thank you for reporting this issue :)

Answer 6 · 2024-09-30T08:16:43.000Z

closing as fixed