Generate prompt masks from JSONL
harubaru opened this issue · 1 comments
harubaru commented
Currently, the tokenizer accepts plaintext and when fed through the GPT Finetuner it optimizes on the entire context. For use-cases where optimization should only be performed on a target response, it would be useful to format training data in JSONL which would consist of request/response pairs in order to mask the request when calculating loss during training.
{"prompt":"Overjoyed with the new iPhone! ->", "completion":" positive"}
{"prompt":"@lakers disappoint for a third straight night ->", "completion":" negative"}
In the end, the tokenizer should output a .tokens
and .mask
file, where .mask
contains the masking for the prompts in .tokens
.
harubaru commented
Moving to coreweave/kubernetes-cloud#182