wbrown/gpt_bpe

Generate prompt masks from JSONL

harubaru opened this issue · 1 comments

Currently, the tokenizer accepts plaintext and when fed through the GPT Finetuner it optimizes on the entire context. For use-cases where optimization should only be performed on a target response, it would be useful to format training data in JSONL which would consist of request/response pairs in order to mask the request when calculating loss during training.

{"prompt":"Overjoyed with the new iPhone! ->", "completion":" positive"}
{"prompt":"@lakers disappoint for a third straight night ->", "completion":" negative"}

In the end, the tokenizer should output a .tokens and .mask file, where .mask contains the masking for the prompts in .tokens.