Watermarking a Prompt
mckunkel opened this issue · 1 comments
Is it possible to just waterwark the input?
Say I want to see if I can watermark a prompt such as :
"You are a expert ML system engineer. How do I integrate X with Y"
to which I will get a similar prompt back, just watermarked?
thanks
Your request sounds like asking to add a watermark to existing text. This isn't the setting under which we developed and tested our watermark, however, one possible way to do this is to formulate it as a paraphrasing task. You prepare a "paraphrase prompt" and then provide it to your model along with the actual text you want to add the watermark to. What it produces should be a rewritten version of the prompt you gave, but with some amount of watermark signal added.
That said, most open source, base language models, or even instruction tuned ones are not amazing paraphrasers. So, you might want to specifically use (or train) a model for paraphrasing tasks, and then, as long as it has a "next token prediction" language modeling head, our watermark can be added to it.
Keep in mind that this is actually also an example of a type of security threat that we'd generally call a "spoofing" attack - making un-watermarked text appear as if it were generated with the watermark. How well this works for our watermark, as well as for the many other watermarking strategies that have been proposed, is an important area of future research! If you end up conducting some tests, consider reporting the results to the community 🙂