huggingface/autotrain-advanced

[BUG] Running sentence-transformers:triplet task with local data will fail with exception: KeyError: 'target'

Closed this issue · 6 comments

Prerequisites

  • I have read the documentation.
  • I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config train.yml

UI Screenshots & Parameters

No response

Error Logs

err

Additional Information

the default value of target_column is set at autotrain/trainers/sent_transformers/params.py as follow:
image
but triplet task do not need a target_column, so train datas have no target_column in schema.

In this case, SentenceTransformersPreprocessor.prepare_columns will raise a KeyError exception because target_column have a default value 'target'.

image

To fix this bug, set the default value of target_column to None.

btw. the default value of sentence3_column has same bug.

how does your dataset look like? can you post first few lines?

how does your dataset look like? can you post first few lines?

@abhishekkrthakur
here are some few lines in my datasets:

{"anchor": "黑缘粗角肖叶甲触角有多大?", "positive": "触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。", "negative": "体长卵形,棕红色;鞘翅棕黄或淡棕色,外缘和中缝黑色或黑褐色;触角基部3、4节棕黄,余节棕色。"}
{"anchor": "黑缘粗角肖叶甲触角有多大?", "positive": "触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。", "negative": "头部刻点粗大,分布不均匀,头顶刻点十分稀疏;触角基部的内侧有一个三角形光瘤,唇基前缘呈半圆形凹切。"}
{"anchor": "黑缘粗角肖叶甲触角有多大?", "positive": "触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。", "negative": "前胸背板横宽,宽约为长的两倍,侧缘敞出较宽,圆形,敞边与盘区之间有一条细纵沟;盘区刻点相当密,前半部刻点较大于后半部。"}
{"anchor": "黑缘粗角肖叶甲触角有多大?", "positive": "触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。", "negative": "小盾片舌形,光亮,末端圆钝。"}
{"anchor": "黑缘粗角肖叶甲触角有多大?", "positive": "触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。", "negative": "鞘翅刻点粗大,不规则排列,肩部之后的刻点更为粗大,具皱褶,近中缝的刻点较小,略呈纵行排列。"}
{"anchor": "黑缘粗角肖叶甲触角有多大?", "positive": "触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。", "negative": "前胸前侧片前缘直;前胸后侧片具粗大刻点。"}

and here are my train configs:

task: sentence-transformers:triplet
base_model: /mnt/bn/query-rewrite/autotrain/Alibaba-NLP/gte-large-en-v1.5
project_name: gte-large-en-v1-5-st-triplet-local-dataset
log: tensorboard
backend: local

data:
path: /mnt/bn/query-rewrite/autotrain/data/datas # this must be the path to the directory containing the train and valid files
train_split: train # this is the name of the train file (csv or jsonl)
valid_split: test # this is the name of the valid file (csv or jsonl), optional
column_mapping:
sentence1_column: anchor
sentence2_column: positive
sentence3_column: negative

params:
max_seq_length: 8192
epochs: 1
batch_size: 8
lr: 2e-5
optimizer: adamw_torch
scheduler: linear
gradient_accumulation: 1
mixed_precision: fp16

okay. i was able to reproduce the error.
a quick fix would be to add target_column: null in column mappings. that way you should be able to train the model.
in the meantime, im taking a look at how to fix this issue properly.

also fixed in version 0.8.20 and above.
thank you for reporting this issue :)

closing as fixed